[2026-03-25 14:06:28,018][mllm.models.large_language_model_local][INFO] - Initializing adapter 'agent_adapter': no initial weights provided or found; starting from scratch. [2026-03-25 14:06:29,733][mllm.models.adapter_training_wrapper][INFO] - Adapter 'agent_adapter': initialized with fresh weights (no initial weights found). [2026-03-25 14:06:29,740][mllm.models.large_language_model_local][INFO] - Initializing adapter 'critic_adapter': no initial weights provided or found; starting from scratch. [2026-03-25 14:06:34,938][mllm.models.adapter_training_wrapper][INFO] - Adapter 'critic_adapter': initialized with fresh weights (no initial weights found). [2026-03-25 14:09:55,265][__main__][INFO] - Starting iteration 0. [2026-03-25 14:09:55,271][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 14:09:55,272][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:10:03,823][__main__][INFO] - Number of regex retries in iteration 0: 0 [2026-03-25 14:10:03,825][__main__][INFO] - agents played in iteration 0 are Bob, Alice [2026-03-25 14:10:04,519][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 37.42%, Block Peak % of device VRAM: 18.66%, ΔTime: 00:00:00 [2026-03-25 14:10:04,588][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 37.42%, Block Peak % of device VRAM: 18.66%, ΔTime: 00:00:00 [2026-03-25 14:10:04,588][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:10:04,589][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:10:05,234][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:10:06,447][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:10:07,105][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:10:07,760][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:10:08,416][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:10:09,071][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:10:09,726][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:10:10,383][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:10:11,038][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:10:11,695][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:10:12,351][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:10:13,008][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:10:13,665][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:10:14,323][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:10:14,980][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:10:15,636][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:10:16,294][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:10:16,950][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:10:17,608][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:10:18,264][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:10:18,923][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:10:19,581][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:10:20,237][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:10:20,894][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:10:21,552][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:10:22,208][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:10:22,864][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:10:23,522][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:10:24,179][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:10:24,837][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:10:25,494][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:10:26,151][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:10:26,808][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:10:27,464][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:10:28,121][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:10:28,779][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:10:29,435][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:10:30,093][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:10:30,749][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:10:31,405][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:10:32,062][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:10:32,719][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:10:33,635][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:10:34,292][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:10:34,949][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:10:35,608][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:10:36,264][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:10:36,922][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:10:37,580][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:10:38,238][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:10:38,897][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:10:39,556][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:10:40,216][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:10:40,874][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:10:41,531][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:10:42,187][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:10:42,843][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:10:43,501][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:10:44,157][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:10:44,815][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:10:45,473][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:10:46,130][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:10:46,788][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:10:47,446][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:10:48,105][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:10:49,079][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 4.51%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.14%, ΔTime: 00:00:43 [2026-03-25 14:10:50,300][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:10:50,302][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:10:50,303][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:10:52,004][__main__][INFO] - Iteration 1 took 56s (15.08% Gen, 81.92% Train). Generation: 8s, Training: 46s. Estimated remaining time: 15h 40m 46s. Estimated total time: 15h 45m 33s. Time estimates for 10 more iterations: 9m 27s, 100 more iterations: 1h 34m 33s, 500 more iterations: 7h 52m 46s. [2026-03-25 14:10:52,006][__main__][INFO] - Starting iteration 1. [2026-03-25 14:10:52,010][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 14:10:52,011][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:10:58,132][__main__][INFO] - Number of regex retries in iteration 1: 0 [2026-03-25 14:10:58,134][__main__][INFO] - agents played in iteration 1 are Bob, Alice [2026-03-25 14:10:58,770][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:10:58,842][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:10:58,842][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:10:58,843][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:10:59,709][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:11:00,362][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:11:01,017][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:11:01,677][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:11:02,336][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:11:02,995][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:11:03,655][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:11:04,315][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:11:04,973][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:11:05,633][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:11:06,291][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:11:06,951][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:11:07,609][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:11:08,267][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:11:08,925][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:11:09,583][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:11:10,241][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:11:10,901][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:11:11,561][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:11:12,221][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:11:12,881][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:11:13,540][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:11:14,200][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:11:14,859][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:11:15,517][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:11:16,175][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:11:16,834][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:11:17,492][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:11:18,150][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:11:18,810][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:11:19,468][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:11:20,129][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:11:20,787][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:11:21,446][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:11:22,105][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:11:22,764][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:11:23,422][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:11:24,081][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:11:24,739][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:11:25,397][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:11:26,055][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:11:26,718][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:11:27,376][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:11:28,034][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:11:28,694][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:11:29,352][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:11:30,012][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:11:30,670][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:11:31,679][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:11:32,338][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:11:32,998][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:11:33,658][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:11:34,317][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:11:34,979][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:11:35,639][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:11:36,299][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:11:36,957][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:11:37,616][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:11:38,276][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:11:38,934][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:11:39,593][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:11:40,253][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:11:40,911][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:11:41,569][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:11:42,229][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:11:43,085][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 14:11:44,416][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:11:44,419][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:11:44,420][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:11:45,994][__main__][INFO] - Iteration 2 took 53s (11.34% Gen, 85.74% Train). Generation: 6s, Training: 46s. Estimated remaining time: 14h 54m 4s. Estimated total time: 14h 59m 45s. Time estimates for 10 more iterations: 8m 59s, 100 more iterations: 1h 29m 58s, 500 more iterations: 7h 29m 52s. [2026-03-25 14:11:45,996][__main__][INFO] - Starting iteration 2. [2026-03-25 14:11:46,000][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 14:11:46,000][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:11:52,738][__main__][INFO] - Number of regex retries in iteration 2: 0 [2026-03-25 14:11:52,739][__main__][INFO] - agents played in iteration 2 are Bob, Alice [2026-03-25 14:11:53,393][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:11:53,460][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:11:53,461][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:11:53,462][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:11:54,264][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:11:54,991][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:11:55,635][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:11:56,296][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:11:56,956][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:11:57,618][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:11:58,279][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:11:58,943][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:11:59,605][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:12:00,265][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:12:00,926][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:12:01,586][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:12:02,246][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:12:02,905][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:12:03,564][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:12:04,224][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:12:04,882][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:12:05,541][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:12:06,200][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:12:06,860][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:12:07,519][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:12:08,178][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:12:08,837][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:12:09,496][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:12:10,155][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:12:10,815][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:12:11,475][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:12:12,134][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:12:12,793][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:12:13,452][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:12:14,113][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:12:14,772][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:12:15,432][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:12:16,091][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:12:16,753][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:12:17,413][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:12:18,073][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:12:18,734][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:12:19,394][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:12:20,053][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:12:20,713][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:12:21,373][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:12:22,034][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:12:22,693][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:12:23,352][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:12:24,014][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:12:24,673][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:12:25,334][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:12:26,299][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:12:26,959][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:12:27,617][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:12:28,291][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:12:28,954][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:12:29,612][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:12:30,270][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:12:30,931][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:12:31,590][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:12:32,250][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:12:32,915][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:12:33,575][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:12:34,235][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:12:34,896][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:12:35,556][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:12:36,216][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:12:36,878][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:12:37,710][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 14:12:39,069][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:12:39,072][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:12:39,073][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:12:40,474][__main__][INFO] - Iteration 3 took 54s (12.37% Gen, 85.06% Train). Generation: 6s, Training: 46s. Estimated remaining time: 15h 1m 20s. Estimated total time: 15h 7m 55s. Time estimates for 10 more iterations: 9m 4s, 100 more iterations: 1h 30m 47s, 500 more iterations: 7h 33m 57s. [2026-03-25 14:12:40,477][__main__][INFO] - Starting iteration 3. [2026-03-25 14:12:40,482][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 14:12:40,483][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:12:46,462][__main__][INFO] - Number of regex retries in iteration 3: 0 [2026-03-25 14:12:46,463][__main__][INFO] - agents played in iteration 3 are Bob, Alice [2026-03-25 14:12:47,210][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:12:47,276][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:12:47,277][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:12:47,277][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:12:48,055][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:12:48,694][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:12:49,346][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:12:50,007][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:12:50,669][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:12:51,331][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:12:51,992][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:12:52,653][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:12:53,312][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:12:53,972][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:12:54,631][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:12:55,291][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:12:55,949][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:12:56,609][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:12:57,269][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:12:57,928][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:12:58,589][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:12:59,248][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:12:59,908][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:13:00,566][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:13:01,226][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:13:01,885][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:13:02,545][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:13:03,204][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:13:03,863][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:13:04,523][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:13:05,182][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:13:05,842][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:13:06,501][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:13:07,160][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:13:07,819][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:13:08,480][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:13:09,139][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:13:09,798][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:13:10,457][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:13:11,116][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:13:11,774][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:13:12,434][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:13:13,093][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:13:13,752][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:13:14,411][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:13:15,071][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:13:15,730][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:13:16,389][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:13:17,048][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:13:17,709][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:13:18,368][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:13:19,027][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:13:20,000][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:13:20,659][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:13:21,317][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:13:21,975][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:13:22,633][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:13:23,293][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:13:23,951][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:13:24,609][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:13:25,267][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:13:25,927][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:13:26,587][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:13:27,246][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:13:27,906][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:13:28,565][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:13:29,225][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:13:29,883][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:13:30,542][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:13:31,291][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 14:13:32,635][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:13:32,638][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:13:32,639][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:13:34,039][__main__][INFO] - Iteration 4 took 53s (11.16% Gen, 86.22% Train). Generation: 5s, Training: 46s. Estimated remaining time: 14h 45m 10s. Estimated total time: 14h 52m 38s. Time estimates for 10 more iterations: 8m 55s, 100 more iterations: 1h 29m 15s, 500 more iterations: 7h 26m 19s. [2026-03-25 14:13:34,041][__main__][INFO] - Starting iteration 4. [2026-03-25 14:13:34,045][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 14:13:34,046][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:13:39,986][__main__][INFO] - Number of regex retries in iteration 4: 0 [2026-03-25 14:13:39,987][__main__][INFO] - agents played in iteration 4 are Bob, Alice [2026-03-25 14:13:40,582][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:13:40,648][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:13:40,648][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:13:40,649][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:13:41,414][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:13:42,049][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:13:42,699][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:13:43,361][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:13:44,022][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:13:44,683][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:13:45,342][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:13:46,001][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:13:46,662][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:13:47,321][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:13:47,981][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:13:48,641][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:13:49,301][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:13:49,960][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:13:50,620][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:13:51,280][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:13:51,938][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:13:52,603][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:13:53,264][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:13:53,923][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:13:54,582][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:13:55,241][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:13:55,900][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:13:56,561][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:13:57,220][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:13:57,878][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:13:58,538][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:13:59,197][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:13:59,856][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:14:00,515][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:14:01,176][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:14:01,836][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:14:02,495][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:14:03,154][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:14:03,813][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:14:04,472][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:14:05,131][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:14:05,791][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:14:06,449][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:14:07,108][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:14:07,768][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:14:08,426][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:14:09,085][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:14:09,745][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:14:10,404][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:14:11,063][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:14:11,723][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:14:12,382][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:14:13,355][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:14:14,014][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:14:14,674][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:14:15,332][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:14:15,990][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:14:16,650][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:14:17,310][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:14:17,969][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:14:18,630][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:14:19,289][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:14:19,949][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:14:20,608][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:14:21,270][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:14:21,929][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:14:22,588][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:14:23,248][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:14:23,906][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:14:24,774][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 14:14:26,105][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:14:26,109][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:14:26,110][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:14:27,467][__main__][INFO] - Iteration 5 took 53s (11.12% Gen, 86.33% Train). Generation: 5s, Training: 46s. Estimated remaining time: 14h 42m 1s. Estimated total time: 14h 50m 24s. Time estimates for 10 more iterations: 8m 54s, 100 more iterations: 1h 29m 2s, 500 more iterations: 7h 25m 12s. [2026-03-25 14:14:27,470][__main__][INFO] - Starting iteration 5. [2026-03-25 14:14:27,475][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 14:14:27,476][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:14:33,732][__main__][INFO] - Number of regex retries in iteration 5: 0 [2026-03-25 14:14:33,733][__main__][INFO] - agents played in iteration 5 are Bob, Alice [2026-03-25 14:14:34,452][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:14:34,524][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:14:34,526][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:14:34,526][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:14:35,340][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:14:35,950][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:14:36,612][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:14:37,271][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:14:37,931][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:14:38,591][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:14:39,251][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:14:39,911][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:14:40,569][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:14:41,227][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:14:41,888][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:14:42,547][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:14:43,205][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:14:43,863][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:14:44,524][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:14:45,183][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:14:45,842][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:14:46,499][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:14:47,160][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:14:47,819][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:14:48,477][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:14:49,134][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:14:49,793][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:14:50,452][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:14:51,109][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:14:51,769][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:14:52,428][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:14:53,085][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:14:53,744][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:14:54,403][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:14:55,060][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:14:55,718][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:14:56,377][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:14:57,036][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:14:57,693][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:14:58,352][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:14:59,011][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:14:59,669][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:15:00,327][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:15:00,986][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:15:01,646][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:15:02,303][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:15:02,962][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:15:03,620][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:15:04,279][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:15:04,938][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:15:05,600][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:15:06,259][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:15:07,243][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:15:07,903][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:15:08,561][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:15:09,220][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:15:09,880][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:15:10,540][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:15:11,199][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:15:11,858][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:15:12,518][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:15:13,179][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:15:13,839][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:15:14,497][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:15:15,157][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:15:15,815][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:15:16,474][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:15:17,133][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:15:17,791][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:15:18,664][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 14:15:20,478][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:15:20,482][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:15:20,483][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:15:21,830][__main__][INFO] - Iteration 6 took 54s (11.51% Gen, 86.00% Train). Generation: 6s, Training: 46s. Estimated remaining time: 14h 56m 40s. Estimated total time: 15h 5m 57s. Time estimates for 10 more iterations: 9m 3s, 100 more iterations: 1h 30m 35s, 500 more iterations: 7h 32m 58s. [2026-03-25 14:15:21,833][__main__][INFO] - Starting iteration 6. [2026-03-25 14:15:21,840][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 14:15:21,840][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:15:27,806][__main__][INFO] - Number of regex retries in iteration 6: 0 [2026-03-25 14:15:27,806][__main__][INFO] - agents played in iteration 6 are Bob, Alice [2026-03-25 14:15:28,385][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:15:28,458][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:15:28,458][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:15:28,459][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:15:29,205][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:15:29,818][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:15:30,477][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:15:31,136][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:15:31,794][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:15:32,453][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:15:33,112][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:15:33,773][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:15:34,435][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:15:35,095][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:15:35,755][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:15:36,414][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:15:37,072][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:15:37,731][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:15:38,390][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:15:39,050][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:15:39,708][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:15:40,366][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:15:41,025][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:15:41,684][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:15:42,343][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:15:43,001][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:15:43,662][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:15:44,321][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:15:44,980][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:15:45,641][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:15:46,301][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:15:46,961][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:15:47,620][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:15:48,279][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:15:48,938][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:15:49,597][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:15:50,255][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:15:50,914][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:15:51,572][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:15:52,230][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:15:52,891][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:15:53,550][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:15:54,208][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:15:54,866][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:15:55,524][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:15:56,183][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:15:56,846][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:15:57,502][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:15:58,162][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:15:58,821][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:15:59,481][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:16:00,140][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:16:01,147][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:16:01,806][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:16:02,464][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:16:03,123][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:16:03,782][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:16:04,441][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:16:05,099][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:16:05,758][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:16:06,416][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:16:07,075][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:16:07,735][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:16:08,399][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:16:09,060][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:16:09,718][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:16:10,379][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:16:11,038][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:16:11,697][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:16:12,502][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 14:16:13,825][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:16:13,828][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:16:13,830][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:16:15,279][__main__][INFO] - Iteration 7 took 53s (11.16% Gen, 86.12% Train). Generation: 5s, Training: 46s. Estimated remaining time: 14h 40m 32s. Estimated total time: 14h 50m 42s. Time estimates for 10 more iterations: 8m 54s, 100 more iterations: 1h 29m 4s, 500 more iterations: 7h 25m 21s. [2026-03-25 14:16:15,282][__main__][INFO] - Starting iteration 7. [2026-03-25 14:16:15,286][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 14:16:15,287][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:16:22,109][__main__][INFO] - Number of regex retries in iteration 7: 0 [2026-03-25 14:16:22,110][__main__][INFO] - agents played in iteration 7 are Bob, Alice [2026-03-25 14:16:22,954][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:16:23,045][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:16:23,046][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:16:23,047][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:16:23,717][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:16:24,332][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:16:24,992][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:16:25,650][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:16:26,309][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:16:26,969][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:16:27,629][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:16:28,290][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:16:28,949][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:16:29,609][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:16:30,268][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:16:30,927][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:16:31,585][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:16:32,244][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:16:32,903][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:16:33,562][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:16:34,221][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:16:34,879][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:16:35,538][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:16:36,196][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:16:36,856][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:16:37,514][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:16:38,173][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:16:38,831][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:16:39,490][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:16:40,149][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:16:40,808][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:16:41,467][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:16:42,126][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:16:42,784][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:16:43,443][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:16:44,102][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:16:44,760][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:16:45,418][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:16:46,076][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:16:46,735][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:16:47,395][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:16:48,053][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:16:48,711][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:16:49,370][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:16:50,029][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:16:50,688][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:16:51,347][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:16:52,006][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:16:52,667][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:16:53,325][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:16:53,983][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:16:54,641][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:16:55,623][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:16:56,283][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:16:56,941][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:16:57,600][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:16:58,259][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:16:58,919][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:16:59,578][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:17:00,236][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:17:00,895][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:17:01,554][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:17:02,213][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:17:02,872][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:17:03,531][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:17:04,189][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:17:04,848][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:17:05,507][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:17:06,166][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:17:06,914][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 14:17:08,290][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:17:08,293][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:17:08,294][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:17:09,714][__main__][INFO] - Iteration 8 took 54s (12.54% Gen, 84.85% Train). Generation: 6s, Training: 46s. Estimated remaining time: 14h 56m 5s. Estimated total time: 15h 7m 9s. Time estimates for 10 more iterations: 9m 4s, 100 more iterations: 1h 30m 42s, 500 more iterations: 7h 33m 34s. [2026-03-25 14:17:09,716][__main__][INFO] - Starting iteration 8. [2026-03-25 14:17:09,720][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 14:17:09,720][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:17:15,227][__main__][INFO] - Number of regex retries in iteration 8: 0 [2026-03-25 14:17:15,228][__main__][INFO] - agents played in iteration 8 are Bob, Alice [2026-03-25 14:17:15,933][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:17:15,999][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:17:15,999][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:17:16,000][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:17:16,897][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:17:17,523][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:17:18,186][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:17:18,847][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:17:19,507][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:17:20,167][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:17:20,827][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:17:21,486][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:17:22,149][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:17:22,809][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:17:23,468][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:17:24,128][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:17:24,787][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:17:25,447][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:17:26,107][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:17:26,767][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:17:27,426][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:17:28,084][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:17:28,744][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:17:29,404][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:17:30,065][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:17:30,725][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:17:31,384][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:17:32,043][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:17:32,702][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:17:33,361][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:17:34,020][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:17:34,680][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:17:35,339][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:17:35,997][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:17:36,656][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:17:37,314][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:17:37,973][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:17:38,632][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:17:39,292][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:17:39,951][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:17:40,611][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:17:41,269][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:17:41,929][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:17:42,589][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:17:43,248][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:17:43,908][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:17:44,567][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:17:45,226][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:17:45,885][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:17:46,545][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:17:47,204][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:17:47,863][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:17:48,846][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:17:49,505][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:17:50,164][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:17:50,823][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:17:51,483][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:17:52,144][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:17:52,803][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:17:53,463][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:17:54,123][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:17:54,781][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:17:55,440][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:17:56,098][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:17:56,755][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:17:57,414][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:17:58,073][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:17:58,733][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:17:59,391][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:18:00,188][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 14:18:01,995][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:18:01,997][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:18:01,998][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:18:03,341][__main__][INFO] - Iteration 9 took 53s (10.27% Gen, 87.22% Train). Generation: 5s, Training: 46s. Estimated remaining time: 14h 41m 45s. Estimated total time: 14h 53m 43s. Time estimates for 10 more iterations: 8m 56s, 100 more iterations: 1h 29m 22s, 500 more iterations: 7h 26m 51s. [2026-03-25 14:18:03,344][__main__][INFO] - Starting iteration 9. [2026-03-25 14:18:03,348][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 14:18:03,348][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:18:10,416][__main__][INFO] - Number of regex retries in iteration 9: 0 [2026-03-25 14:18:10,417][__main__][INFO] - agents played in iteration 9 are Bob, Alice [2026-03-25 14:18:11,031][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:18:11,098][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:18:11,099][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:18:11,100][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:18:11,920][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:18:12,552][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:18:13,214][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:18:13,874][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:18:14,535][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:18:15,196][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:18:15,856][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:18:16,515][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:18:17,175][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:18:17,836][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:18:18,496][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:18:19,158][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:18:19,817][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:18:20,478][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:18:21,138][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:18:21,800][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:18:22,460][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:18:23,119][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:18:23,781][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:18:24,441][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:18:25,101][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:18:25,760][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:18:26,418][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:18:27,077][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:18:27,736][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:18:28,398][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:18:29,058][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:18:29,716][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:18:30,375][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:18:31,035][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:18:31,693][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:18:32,352][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:18:33,011][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:18:33,670][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:18:34,329][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:18:34,987][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:18:35,646][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:18:36,305][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:18:36,965][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:18:37,624][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:18:38,284][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:18:38,943][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:18:39,602][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:18:40,261][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:18:40,920][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:18:41,580][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:18:42,239][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:18:42,898][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:18:43,885][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:18:44,544][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:18:45,202][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:18:45,861][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:18:46,522][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:18:47,180][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:18:47,839][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:18:48,498][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:18:49,157][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:18:49,815][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:18:50,473][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:18:51,131][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:18:51,790][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:18:52,449][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:18:53,108][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:18:53,766][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:18:54,425][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:18:55,325][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 14:18:56,632][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:18:56,635][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:18:56,636][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:18:58,192][__main__][INFO] - Iteration 10 took 54s (12.89% Gen, 84.27% Train). Generation: 7s, Training: 46s. Estimated remaining time: 15h 1m 12s. Estimated total time: 15h 14m 5s. Time estimates for 10 more iterations: 9m 8s, 100 more iterations: 1h 31m 24s, 500 more iterations: 7h 37m 2s. [2026-03-25 14:18:58,195][__main__][INFO] - Starting iteration 10. [2026-03-25 14:18:58,198][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 14:18:58,199][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:19:03,334][__main__][INFO] - Number of regex retries in iteration 10: 0 [2026-03-25 14:19:03,335][__main__][INFO] - agents played in iteration 10 are Bob, Alice [2026-03-25 14:19:03,853][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:19:03,918][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:19:03,919][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:19:03,920][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:19:04,580][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:19:05,194][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:19:05,859][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:19:06,519][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:19:07,179][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:19:07,838][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:19:08,496][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:19:09,155][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:19:09,813][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:19:10,472][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:19:11,134][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:19:11,791][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:19:12,450][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:19:13,109][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:19:13,766][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:19:14,424][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:19:15,085][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:19:15,744][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:19:16,405][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:19:17,062][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:19:17,722][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:19:18,382][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:19:19,041][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:19:19,699][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:19:20,359][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:19:21,017][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:19:21,674][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:19:22,332][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:19:22,991][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:19:23,649][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:19:24,307][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:19:24,966][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:19:25,624][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:19:26,282][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:19:26,939][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:19:27,597][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:19:28,257][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:19:28,915][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:19:29,572][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:19:30,230][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:19:30,888][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:19:31,545][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:19:32,204][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:19:32,861][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:19:33,520][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:19:34,180][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:19:34,839][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:19:35,498][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:19:36,478][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:19:37,137][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:19:37,796][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:19:38,453][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:19:39,112][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:19:39,770][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:19:40,428][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:19:41,086][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:19:41,745][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:19:42,403][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:19:43,061][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:19:43,720][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:19:44,378][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:19:45,036][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:19:45,696][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:19:46,355][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:19:47,014][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:19:47,720][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 14:19:49,030][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:19:49,033][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:19:49,034][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:19:50,476][__main__][INFO] - Iteration 11 took 52s (9.82% Gen, 87.41% Train). Generation: 5s, Training: 45s. Estimated remaining time: 14h 17m 33s. Estimated total time: 14h 31m 19s. Time estimates for 10 more iterations: 8m 42s, 100 more iterations: 1h 27m 7s, 500 more iterations: 7h 15m 39s. [2026-03-25 14:19:50,478][__main__][INFO] - Starting iteration 11. [2026-03-25 14:19:50,483][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 14:19:50,484][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:19:56,522][__main__][INFO] - Number of regex retries in iteration 11: 0 [2026-03-25 14:19:56,524][__main__][INFO] - agents played in iteration 11 are Bob, Alice [2026-03-25 14:19:57,724][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:19:57,791][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:19:57,792][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:19:57,792][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:19:58,457][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:19:59,064][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:19:59,724][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:20:00,384][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:20:01,043][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:20:01,702][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:20:02,361][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:20:03,020][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:20:03,677][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:20:04,337][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:20:04,994][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:20:05,653][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:20:06,311][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:20:06,969][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:20:07,628][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:20:08,287][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:20:08,944][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:20:09,603][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:20:10,261][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:20:10,922][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:20:11,581][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:20:12,240][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:20:12,899][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:20:13,558][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:20:14,217][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:20:14,877][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:20:15,534][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:20:16,193][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:20:16,850][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:20:17,508][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:20:18,166][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:20:18,823][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:20:19,481][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:20:20,140][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:20:20,803][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:20:21,464][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:20:22,121][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:20:22,780][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:20:23,442][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:20:24,101][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:20:24,759][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:20:25,418][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:20:26,076][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:20:26,735][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:20:27,393][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:20:28,051][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:20:28,710][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:20:29,371][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:20:30,352][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:20:31,012][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:20:31,672][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:20:32,332][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:20:32,990][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:20:33,648][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:20:34,307][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:20:34,966][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:20:35,624][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:20:36,282][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:20:36,940][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:20:37,599][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:20:38,256][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:20:38,914][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:20:39,572][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:20:40,230][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:20:40,889][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:20:41,642][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 14:20:43,045][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:20:43,048][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:20:43,049][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:20:44,612][__main__][INFO] - Iteration 12 took 54s (11.16% Gen, 85.95% Train). Generation: 6s, Training: 46s. Estimated remaining time: 14h 47m 32s. Estimated total time: 15h 2m 11s. Time estimates for 10 more iterations: 9m 1s, 100 more iterations: 1h 30m 13s, 500 more iterations: 7h 31m 5s. [2026-03-25 14:20:44,614][__main__][INFO] - Starting iteration 12. [2026-03-25 14:20:44,619][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 14:20:44,620][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:20:50,198][__main__][INFO] - Number of regex retries in iteration 12: 0 [2026-03-25 14:20:50,199][__main__][INFO] - agents played in iteration 12 are Bob, Alice [2026-03-25 14:20:50,801][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:20:50,865][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:20:50,866][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:20:50,866][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:20:51,576][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:20:52,187][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:20:52,851][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:20:53,512][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:20:54,171][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:20:54,832][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:20:55,491][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:20:56,150][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:20:56,809][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:20:57,469][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:20:58,128][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:20:58,788][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:20:59,448][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:21:00,107][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:21:00,769][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:21:01,428][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:21:02,090][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:21:02,748][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:21:03,407][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:21:04,069][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:21:04,731][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:21:05,391][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:21:06,051][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:21:06,711][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:21:07,372][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:21:08,034][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:21:08,696][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:21:09,356][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:21:10,017][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:21:10,677][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:21:11,339][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:21:11,999][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:21:12,658][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:21:13,319][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:21:13,979][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:21:14,639][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:21:15,299][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:21:15,959][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:21:16,619][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:21:17,279][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:21:17,940][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:21:18,599][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:21:19,259][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:21:19,918][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:21:20,579][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:21:21,239][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:21:21,900][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:21:22,560][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:21:23,535][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:21:24,200][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:21:24,858][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:21:25,518][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:21:26,177][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:21:26,838][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:21:27,497][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:21:28,158][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:21:28,817][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:21:29,475][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:21:30,135][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:21:30,794][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:21:31,453][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:21:32,113][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:21:32,771][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:21:33,430][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:21:34,091][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:21:34,843][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 14:21:36,214][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:21:36,217][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:21:36,218][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:21:37,679][__main__][INFO] - Iteration 13 took 53s (10.52% Gen, 86.72% Train). Generation: 5s, Training: 46s. Estimated remaining time: 14h 28m 50s. Estimated total time: 14h 44m 23s. Time estimates for 10 more iterations: 8m 50s, 100 more iterations: 1h 28m 26s, 500 more iterations: 7h 22m 11s. [2026-03-25 14:21:37,681][__main__][INFO] - Starting iteration 13. [2026-03-25 14:21:37,686][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 14:21:37,686][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:21:42,935][__main__][INFO] - Number of regex retries in iteration 13: 0 [2026-03-25 14:21:42,937][__main__][INFO] - agents played in iteration 13 are Bob, Alice [2026-03-25 14:21:43,562][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:21:43,626][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:21:43,626][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:21:43,627][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:21:44,281][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:21:44,901][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:21:45,562][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:21:46,223][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:21:46,885][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:21:47,546][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:21:48,207][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:21:48,867][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:21:49,526][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:21:50,185][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:21:50,846][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:21:51,506][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:21:52,165][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:21:52,824][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:21:53,484][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:21:54,142][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:21:54,801][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:21:55,460][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:21:56,120][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:21:56,779][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:21:57,438][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:21:58,096][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:21:58,756][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:21:59,416][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:22:00,078][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:22:00,738][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:22:01,399][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:22:02,061][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:22:02,720][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:22:03,381][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:22:04,039][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:22:04,698][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:22:05,356][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:22:06,014][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:22:06,674][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:22:07,333][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:22:07,993][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:22:08,658][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:22:09,319][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:22:09,978][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:22:10,641][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:22:11,301][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:22:11,960][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:22:12,620][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:22:13,279][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:22:13,939][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:22:14,600][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:22:15,260][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:22:16,237][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:22:16,898][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:22:17,559][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:22:18,219][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:22:18,876][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:22:19,535][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:22:20,193][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:22:20,854][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:22:21,514][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:22:22,173][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:22:22,835][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:22:23,495][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:22:24,154][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:22:24,812][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:22:25,472][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:22:26,131][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:22:26,790][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:22:27,553][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 14:22:29,015][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:22:29,019][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:22:29,020][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:22:30,428][__main__][INFO] - Iteration 14 took 52s (9.95% Gen, 87.37% Train). Generation: 5s, Training: 46s. Estimated remaining time: 14h 22m 39s. Estimated total time: 14h 39m 4s. Time estimates for 10 more iterations: 8m 47s, 100 more iterations: 1h 27m 54s, 500 more iterations: 7h 19m 32s. [2026-03-25 14:22:30,430][__main__][INFO] - Starting iteration 14. [2026-03-25 14:22:30,435][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 14:22:30,435][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:22:35,622][__main__][INFO] - Number of regex retries in iteration 14: 0 [2026-03-25 14:22:35,622][__main__][INFO] - agents played in iteration 14 are Bob, Alice [2026-03-25 14:22:36,209][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:22:36,275][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:22:36,276][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:22:36,276][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:22:37,086][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:22:37,694][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:22:38,354][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:22:39,013][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:22:39,673][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:22:40,333][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:22:40,992][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:22:41,652][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:22:42,310][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:22:42,969][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:22:43,628][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:22:44,288][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:22:44,946][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:22:45,605][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:22:46,262][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:22:46,921][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:22:47,580][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:22:48,242][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:22:48,902][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:22:49,559][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:22:50,218][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:22:50,877][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:22:51,537][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:22:52,198][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:22:52,856][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:22:53,514][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:22:54,172][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:22:54,830][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:22:55,488][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:22:56,146][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:22:56,804][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:22:57,464][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:22:58,122][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:22:58,781][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:22:59,439][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:23:00,097][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:23:00,756][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:23:01,415][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:23:02,073][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:23:02,732][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:23:03,391][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:23:04,050][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:23:04,709][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:23:05,367][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:23:06,026][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:23:06,685][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:23:07,343][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:23:08,002][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:23:09,020][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:23:09,679][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:23:10,338][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:23:10,996][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:23:11,654][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:23:12,312][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:23:12,971][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:23:13,629][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:23:14,288][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:23:14,946][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:23:15,604][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:23:16,264][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:23:16,923][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:23:17,582][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:23:18,242][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:23:18,899][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:23:19,558][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:23:20,363][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 14:23:21,775][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:23:21,778][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:23:21,779][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:23:23,314][__main__][INFO] - Iteration 15 took 52s (9.81% Gen, 87.28% Train). Generation: 5s, Training: 46s. Estimated remaining time: 14h 24m 3s. Estimated total time: 14h 41m 21s. Time estimates for 10 more iterations: 8m 48s, 100 more iterations: 1h 28m 8s, 500 more iterations: 7h 20m 40s. [2026-03-25 14:23:23,317][__main__][INFO] - Starting iteration 15. [2026-03-25 14:23:23,321][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 14:23:23,321][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:23:28,718][__main__][INFO] - Number of regex retries in iteration 15: 0 [2026-03-25 14:23:28,719][__main__][INFO] - agents played in iteration 15 are Bob, Alice [2026-03-25 14:23:29,294][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:23:29,358][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:23:29,358][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:23:29,359][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:23:30,313][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:23:30,928][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:23:31,590][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:23:32,249][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:23:32,908][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:23:33,568][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:23:34,229][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:23:34,888][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:23:35,547][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:23:36,208][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:23:36,867][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:23:37,527][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:23:38,187][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:23:38,845][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:23:39,504][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:23:40,163][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:23:40,822][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:23:41,481][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:23:42,141][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:23:42,803][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:23:43,463][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:23:44,125][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:23:44,785][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:23:45,445][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:23:46,104][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:23:46,764][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:23:47,423][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:23:48,081][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:23:48,741][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:23:49,401][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:23:50,060][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:23:50,721][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:23:51,380][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:23:52,039][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:23:52,699][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:23:53,358][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:23:54,016][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:23:54,675][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:23:55,334][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:23:55,992][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:23:56,652][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:23:57,311][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:23:57,969][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:23:58,629][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:23:59,288][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:23:59,947][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:24:00,608][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:24:01,268][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:24:02,278][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:24:02,937][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:24:03,595][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:24:04,254][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:24:04,912][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:24:05,570][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:24:06,228][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:24:06,886][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:24:07,544][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:24:08,203][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:24:08,861][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:24:09,519][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:24:10,178][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:24:10,836][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:24:11,495][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:24:12,154][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:24:12,812][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:24:13,569][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 14:24:14,868][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:24:14,870][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:24:14,871][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:24:16,310][__main__][INFO] - Iteration 16 took 52s (10.19% Gen, 87.09% Train). Generation: 5s, Training: 46s. Estimated remaining time: 14h 25m 0s. Estimated total time: 14h 43m 11s. Time estimates for 10 more iterations: 8m 49s, 100 more iterations: 1h 28m 19s, 500 more iterations: 7h 21m 35s. [2026-03-25 14:24:16,312][__main__][INFO] - Starting iteration 16. [2026-03-25 14:24:16,316][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 14:24:16,317][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:24:21,882][__main__][INFO] - Number of regex retries in iteration 16: 0 [2026-03-25 14:24:21,883][__main__][INFO] - agents played in iteration 16 are Bob, Alice [2026-03-25 14:24:22,906][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:24:22,974][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:24:22,974][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:24:22,975][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:24:23,693][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:24:24,309][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:24:24,969][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:24:25,631][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:24:26,291][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:24:26,952][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:24:27,611][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:24:28,270][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:24:28,928][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:24:29,587][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:24:30,246][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:24:30,904][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:24:31,562][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:24:32,221][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:24:32,879][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:24:33,537][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:24:34,604][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:24:36,130][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:24:37,383][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:24:38,042][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:24:38,701][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:24:39,360][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:24:40,018][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:24:40,678][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:24:41,336][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:24:41,995][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:24:42,653][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:24:43,311][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:24:43,969][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:24:44,626][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:24:45,285][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:24:45,944][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:24:46,603][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:24:47,262][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:24:47,920][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:24:48,578][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:24:49,236][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:24:49,894][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:24:50,553][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:24:51,212][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:24:51,871][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:24:52,529][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:24:53,189][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:24:53,847][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:24:54,506][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:24:55,163][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:24:55,822][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:24:56,480][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:24:57,459][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:24:58,124][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:24:58,787][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:24:59,447][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:25:00,107][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:25:00,766][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:25:01,425][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:25:02,084][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:25:02,743][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:25:03,402][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:25:04,062][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:25:04,721][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:25:05,381][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:25:06,041][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:25:06,700][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:25:07,359][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:25:08,018][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:25:08,848][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:45 [2026-03-25 14:25:10,456][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:25:10,743][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:25:10,769][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:25:12,583][__main__][INFO] - Iteration 17 took 56s (9.89% Gen, 86.88% Train). Generation: 5s, Training: 48s. Estimated remaining time: 15h 18m 41s. Estimated total time: 15h 37m 48s. Time estimates for 10 more iterations: 9m 22s, 100 more iterations: 1h 33m 46s, 500 more iterations: 7h 48m 54s. [2026-03-25 14:25:12,585][__main__][INFO] - Starting iteration 17. [2026-03-25 14:25:12,589][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 14:25:12,590][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:25:17,705][__main__][INFO] - Number of regex retries in iteration 17: 0 [2026-03-25 14:25:17,706][__main__][INFO] - agents played in iteration 17 are Bob, Alice [2026-03-25 14:25:18,311][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:25:18,380][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:25:18,381][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:25:18,381][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:25:19,048][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:25:19,666][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:25:20,326][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:25:20,988][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:25:21,650][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:25:22,311][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:25:22,971][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:25:23,632][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:25:24,292][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:25:24,952][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:25:25,612][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:25:26,271][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:25:26,932][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:25:27,592][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:25:28,251][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:25:28,912][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:25:29,571][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:25:30,231][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:25:30,890][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:25:31,549][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:25:32,208][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:25:32,868][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:25:33,528][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:25:34,187][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:25:34,846][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:25:35,505][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:25:36,166][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:25:36,825][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:25:37,484][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:25:38,143][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:25:38,807][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:25:39,469][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:25:40,129][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:25:40,787][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:25:41,448][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:25:42,108][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:25:42,769][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:25:43,429][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:25:44,091][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:25:44,751][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:25:45,410][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:25:46,069][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:25:46,730][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:25:47,389][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:25:48,049][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:25:48,709][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:25:49,368][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:25:50,028][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:25:51,012][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:25:51,671][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:25:52,330][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:25:52,988][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:25:53,646][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:25:54,307][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:25:54,966][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:25:55,625][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:25:56,284][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:25:56,942][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:25:57,600][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:25:58,259][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:25:58,918][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:25:59,578][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:26:00,237][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:26:00,898][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:26:01,558][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:26:02,333][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 14:26:03,688][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:26:03,691][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:26:03,692][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:26:05,030][__main__][INFO] - Iteration 18 took 52s (9.76% Gen, 87.69% Train). Generation: 5s, Training: 45s. Estimated remaining time: 14h 14m 3s. Estimated total time: 14h 34m 2s. Time estimates for 10 more iterations: 8m 44s, 100 more iterations: 1h 27m 24s, 500 more iterations: 7h 17m 1s. [2026-03-25 14:26:05,032][__main__][INFO] - Starting iteration 18. [2026-03-25 14:26:05,036][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 14:26:05,037][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:26:10,036][__main__][INFO] - Number of regex retries in iteration 18: 0 [2026-03-25 14:26:10,037][__main__][INFO] - agents played in iteration 18 are Bob, Alice [2026-03-25 14:26:10,503][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:26:10,567][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:26:10,568][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:26:10,569][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:26:11,375][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:26:11,988][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:26:12,649][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:26:13,310][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:26:13,970][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:26:14,629][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:26:15,290][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:26:15,949][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:26:16,609][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:26:17,268][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:26:17,929][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:26:18,589][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:26:19,251][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:26:19,910][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:26:20,570][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:26:21,230][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:26:21,889][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:26:22,550][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:26:23,209][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:26:23,870][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:26:24,529][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:26:25,189][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:26:25,848][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:26:26,507][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:26:27,166][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:26:27,825][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:26:28,486][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:26:29,148][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:26:29,808][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:26:30,469][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:26:31,130][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:26:31,790][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:26:32,450][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:26:33,110][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:26:33,768][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:26:34,428][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:26:35,086][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:26:35,746][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:26:36,405][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:26:37,064][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:26:37,724][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:26:38,382][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:26:39,041][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:26:39,700][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:26:40,359][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:26:41,018][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:26:41,677][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:26:42,336][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:26:43,321][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:26:43,981][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:26:44,639][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:26:45,298][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:26:45,958][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:26:46,618][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:26:47,276][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:26:47,934][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:26:48,592][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:26:49,251][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:26:49,909][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:26:50,568][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:26:51,227][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:26:51,885][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:26:52,544][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:26:53,202][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:26:53,860][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:26:54,624][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 14:26:55,971][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:26:55,974][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:26:55,975][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:26:57,275][__main__][INFO] - Iteration 19 took 52s (9.57% Gen, 87.93% Train). Generation: 5s, Training: 45s. Estimated remaining time: 14h 9m 48s. Estimated total time: 14h 30m 40s. Time estimates for 10 more iterations: 8m 42s, 100 more iterations: 1h 27m 4s, 500 more iterations: 7h 15m 20s. [2026-03-25 14:26:57,277][__main__][INFO] - Starting iteration 19. [2026-03-25 14:26:57,281][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 14:26:57,281][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:27:03,514][__main__][INFO] - Number of regex retries in iteration 19: 0 [2026-03-25 14:27:03,516][__main__][INFO] - agents played in iteration 19 are Bob, Alice [2026-03-25 14:27:04,595][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:27:04,663][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:27:04,664][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:27:04,665][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:27:05,416][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:27:06,036][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:27:06,698][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:27:07,359][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:27:08,020][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:27:08,680][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:27:09,340][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:27:10,000][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:27:10,661][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:27:11,321][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:27:11,995][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:27:12,652][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:27:13,312][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:27:13,973][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:27:14,632][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:27:15,294][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:27:15,955][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:27:16,615][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:27:17,274][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:27:17,935][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:27:18,595][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:27:19,254][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:27:19,914][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:27:20,574][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:27:21,233][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:27:21,893][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:27:22,553][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:27:23,214][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:27:23,873][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:27:24,532][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:27:25,191][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:27:25,850][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:27:26,509][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:27:27,167][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:27:27,827][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:27:28,488][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:27:29,148][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:27:29,808][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:27:30,469][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:27:31,129][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:27:31,790][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:27:32,449][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:27:33,107][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:27:33,767][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:27:34,426][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:27:35,085][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:27:35,744][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:27:36,403][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:27:37,382][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:27:38,042][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:27:38,707][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:27:39,365][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:27:40,025][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:27:40,684][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:27:41,344][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:27:42,002][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:27:42,661][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:27:43,320][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:27:43,981][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:27:44,640][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:27:45,300][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:27:45,960][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:27:46,619][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:27:47,278][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:27:47,938][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:27:48,713][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 14:27:50,106][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:27:50,109][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:27:50,110][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:27:51,604][__main__][INFO] - Iteration 20 took 54s (11.48% Gen, 85.77% Train). Generation: 6s, Training: 46s. Estimated remaining time: 14h 43m 38s. Estimated total time: 15h 5m 25s. Time estimates for 10 more iterations: 9m 3s, 100 more iterations: 1h 30m 32s, 500 more iterations: 7h 32m 42s. [2026-03-25 14:27:51,606][__main__][INFO] - Starting iteration 20. [2026-03-25 14:27:51,611][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 14:27:51,612][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:27:57,109][__main__][INFO] - Number of regex retries in iteration 20: 0 [2026-03-25 14:27:57,110][__main__][INFO] - agents played in iteration 20 are Bob, Alice [2026-03-25 14:27:57,720][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:27:57,787][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:27:57,787][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:27:57,788][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:27:58,513][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:27:59,128][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:27:59,789][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:28:00,451][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:28:01,110][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:28:01,768][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:28:02,429][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:28:03,089][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:28:03,748][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:28:04,407][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:28:05,065][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:28:05,724][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:28:06,382][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:28:07,042][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:28:07,701][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:28:08,360][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:28:09,019][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:28:09,677][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:28:10,336][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:28:10,995][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:28:11,654][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:28:12,311][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:28:12,969][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:28:13,627][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:28:14,287][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:28:14,946][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:28:15,604][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:28:16,262][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:28:16,920][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:28:17,578][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:28:18,237][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:28:18,895][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:28:19,554][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:28:20,212][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:28:20,871][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:28:21,531][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:28:22,189][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:28:22,849][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:28:23,508][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:28:25,026][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:28:25,683][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:28:26,341][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:28:26,999][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:28:27,657][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:28:28,316][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:28:28,974][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:28:29,632][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:28:30,290][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:28:31,268][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:28:31,929][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:28:32,588][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:28:33,246][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:28:33,905][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:28:34,564][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:28:35,222][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:28:35,881][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:28:36,538][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:28:37,197][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:28:37,856][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:28:38,514][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:28:39,173][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:28:39,831][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:28:40,489][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:28:41,147][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:28:41,805][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:28:42,595][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:44 [2026-03-25 14:28:45,223][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:28:45,225][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:28:45,226][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:28:46,752][__main__][INFO] - Iteration 21 took 55s (9.97% Gen, 87.26% Train). Generation: 5s, Training: 48s. Estimated remaining time: 14h 56m 22s. Estimated total time: 15h 19m 3s. Time estimates for 10 more iterations: 9m 11s, 100 more iterations: 1h 31m 54s, 500 more iterations: 7h 39m 31s. [2026-03-25 14:28:46,754][__main__][INFO] - Starting iteration 21. [2026-03-25 14:28:46,759][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 14:28:46,759][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:28:53,213][__main__][INFO] - Number of regex retries in iteration 21: 0 [2026-03-25 14:28:53,213][__main__][INFO] - agents played in iteration 21 are Bob, Alice [2026-03-25 14:28:53,797][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:28:53,862][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:28:53,863][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:28:53,863][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:28:54,715][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:28:55,332][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:28:55,992][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:28:56,655][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:28:57,316][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:28:57,974][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:28:58,636][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:28:59,297][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:28:59,956][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:29:00,614][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:29:01,274][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:29:01,933][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:29:02,593][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:29:03,251][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:29:03,909][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:29:04,570][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:29:05,228][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:29:05,886][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:29:06,545][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:29:07,204][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:29:07,864][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:29:08,523][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:29:09,182][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:29:09,843][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:29:10,503][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:29:11,162][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:29:11,820][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:29:12,479][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:29:13,139][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:29:13,797][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:29:14,455][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:29:15,114][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:29:15,773][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:29:16,432][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:29:17,090][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:29:17,749][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:29:18,407][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:29:19,065][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:29:19,725][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:29:20,384][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:29:21,042][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:29:21,701][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:29:22,359][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:29:23,017][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:29:23,676][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:29:24,334][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:29:24,992][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:29:25,651][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:29:26,632][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:29:27,293][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:29:27,952][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:29:28,610][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:29:29,269][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:29:29,928][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:29:30,586][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:29:31,244][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:29:33,542][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:29:34,804][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:29:35,464][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:29:36,124][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:29:36,783][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:29:37,440][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:29:38,098][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:29:38,761][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:29:39,419][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:29:40,203][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:45 [2026-03-25 14:29:41,564][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:29:41,566][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:29:41,567][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:29:43,016][__main__][INFO] - Iteration 22 took 56s (11.47% Gen, 85.95% Train). Generation: 6s, Training: 48s. Estimated remaining time: 15h 14m 1s. Estimated total time: 15h 37m 39s. Time estimates for 10 more iterations: 9m 22s, 100 more iterations: 1h 33m 45s, 500 more iterations: 7h 48m 49s. [2026-03-25 14:29:43,018][__main__][INFO] - Starting iteration 22. [2026-03-25 14:29:43,023][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 14:29:43,024][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:29:48,993][__main__][INFO] - Number of regex retries in iteration 22: 0 [2026-03-25 14:29:48,995][__main__][INFO] - agents played in iteration 22 are Bob, Alice [2026-03-25 14:29:50,054][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:29:50,120][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:29:50,120][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:29:50,121][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:29:50,828][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:29:51,456][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:29:52,118][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:29:52,777][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:29:53,435][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:29:54,094][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:29:54,753][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:29:55,412][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:29:56,071][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:29:56,730][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:29:57,389][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:29:58,049][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:29:58,713][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:29:59,370][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:30:00,030][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:30:00,690][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:30:01,348][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:30:02,007][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:30:02,665][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:30:03,323][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:30:03,982][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:30:04,641][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:30:05,301][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:30:05,961][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:30:06,619][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:30:07,278][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:30:07,937][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:30:08,598][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:30:09,257][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:30:09,915][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:30:10,575][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:30:11,234][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:30:11,892][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:30:12,551][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:30:13,210][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:30:13,871][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:30:14,530][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:30:15,188][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:30:15,847][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:30:16,505][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:30:17,164][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:30:17,822][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:30:18,480][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:30:19,141][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:30:19,798][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:30:20,457][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:30:21,115][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:30:21,774][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:30:22,753][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:30:23,414][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:30:24,072][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:30:24,730][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:30:25,389][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:30:26,049][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:30:26,707][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:30:27,366][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:30:28,024][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:30:28,684][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:30:29,346][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:30:30,004][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:30:30,663][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:30:31,321][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:30:31,980][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:30:32,637][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:30:33,297][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:30:34,220][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 14:30:35,274][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:30:35,276][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:30:35,277][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:30:36,814][__main__][INFO] - Iteration 23 took 53s (11.10% Gen, 86.04% Train). Generation: 5s, Training: 46s. Estimated remaining time: 14h 32m 1s. Estimated total time: 14h 56m 33s. Time estimates for 10 more iterations: 8m 57s, 100 more iterations: 1h 29m 39s, 500 more iterations: 7h 28m 16s. [2026-03-25 14:30:36,817][__main__][INFO] - Starting iteration 23. [2026-03-25 14:30:36,820][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 14:30:36,821][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:30:41,874][__main__][INFO] - Number of regex retries in iteration 23: 0 [2026-03-25 14:30:41,875][__main__][INFO] - agents played in iteration 23 are Bob, Alice [2026-03-25 14:30:42,382][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:30:42,449][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:30:42,449][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:30:42,450][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:30:43,338][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:30:43,957][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:30:44,620][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:30:45,280][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:30:45,943][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:30:46,602][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:30:47,263][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:30:47,922][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:30:48,581][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:30:49,241][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:30:49,904][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:30:50,563][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:30:51,223][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:30:51,884][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:30:52,545][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:30:53,205][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:30:53,865][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:30:54,525][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:30:55,185][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:30:55,846][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:30:56,505][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:30:57,164][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:30:57,823][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:30:58,484][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:30:59,143][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:30:59,802][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:31:00,461][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:31:01,121][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:31:01,783][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:31:02,443][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:31:03,103][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:31:03,762][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:31:04,421][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:31:05,080][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:31:05,739][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:31:06,398][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:31:07,058][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:31:07,719][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:31:08,380][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:31:09,040][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:31:09,701][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:31:10,362][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:31:11,022][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:31:11,681][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:31:12,341][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:31:13,000][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:31:13,660][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:31:14,322][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:31:15,312][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:31:15,970][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:31:16,629][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:31:17,288][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:31:17,947][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:31:18,610][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:31:19,267][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:31:19,930][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:31:20,588][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:31:21,249][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:31:21,911][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:31:22,570][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:31:23,232][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:31:23,890][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:31:24,551][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:31:25,211][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:31:25,868][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:31:26,740][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 14:31:28,057][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:31:28,060][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:31:28,061][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:31:29,626][__main__][INFO] - Iteration 24 took 52s (9.57% Gen, 87.46% Train). Generation: 5s, Training: 46s. Estimated remaining time: 14h 14m 43s. Estimated total time: 14h 40m 7s. Time estimates for 10 more iterations: 8m 48s, 100 more iterations: 1h 28m 0s, 500 more iterations: 7h 20m 3s. [2026-03-25 14:31:29,629][__main__][INFO] - Starting iteration 24. [2026-03-25 14:31:29,633][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 14:31:29,634][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:31:34,703][__main__][INFO] - Number of regex retries in iteration 24: 0 [2026-03-25 14:31:34,704][__main__][INFO] - agents played in iteration 24 are Bob, Alice [2026-03-25 14:31:35,286][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:31:35,352][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:31:35,353][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:31:35,353][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:31:36,009][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:31:36,627][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:31:37,288][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:31:37,949][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:31:38,609][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:31:39,270][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:31:39,931][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:31:40,591][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:31:41,250][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:31:41,911][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:31:42,570][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:31:43,231][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:31:43,891][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:31:44,550][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:31:45,211][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:31:45,870][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:31:46,529][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:31:47,187][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:31:47,847][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:31:48,507][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:31:49,166][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:31:49,824][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:31:50,485][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:31:51,145][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:31:51,804][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:31:52,464][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:31:53,123][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:31:53,783][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:31:54,443][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:31:55,102][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:31:55,762][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:31:56,422][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:31:57,082][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:31:57,741][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:31:58,401][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:31:59,061][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:31:59,721][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:32:00,380][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:32:01,039][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:32:01,697][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:32:02,356][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:32:03,015][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:32:03,674][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:32:04,333][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:32:04,992][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:32:05,650][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:32:06,309][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:32:06,968][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:32:07,947][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:32:08,609][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:32:09,269][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:32:09,927][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:32:10,586][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:32:11,245][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:32:11,904][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:32:12,562][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:32:13,221][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:32:13,879][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:32:14,539][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:32:15,199][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:32:15,857][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:32:16,514][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:32:17,173][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:32:17,831][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:32:18,491][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:32:19,269][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 14:32:21,033][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:32:21,035][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:32:21,036][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:32:22,443][__main__][INFO] - Iteration 25 took 52s (9.60% Gen, 87.73% Train). Generation: 5s, Training: 46s. Estimated remaining time: 14h 13m 55s. Estimated total time: 14h 40m 12s. Time estimates for 10 more iterations: 8m 48s, 100 more iterations: 1h 28m 1s, 500 more iterations: 7h 20m 6s. [2026-03-25 14:32:22,447][__main__][INFO] - Starting iteration 25. [2026-03-25 14:32:22,451][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 14:32:22,452][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:32:27,354][__main__][INFO] - Number of regex retries in iteration 25: 0 [2026-03-25 14:32:27,354][__main__][INFO] - agents played in iteration 25 are Bob, Alice [2026-03-25 14:32:27,925][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:32:27,994][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:32:27,994][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:32:27,995][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:32:28,651][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:32:29,277][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:32:29,937][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:32:30,595][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:32:31,256][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:32:31,917][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:32:32,577][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:32:33,237][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:32:33,900][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:32:34,560][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:32:35,220][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:32:35,880][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:32:36,540][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:32:37,200][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:32:37,859][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:32:38,518][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:32:39,178][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:32:39,838][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:32:40,497][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:32:41,157][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:32:41,816][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:32:42,474][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:32:43,134][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:32:43,793][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:32:44,452][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:32:45,111][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:32:45,770][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:32:46,429][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:32:47,087][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:32:47,746][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:32:48,406][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:32:49,065][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:32:49,724][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:32:50,383][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:32:51,041][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:32:51,700][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:32:52,358][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:32:53,017][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:32:53,677][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:32:54,336][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:32:54,994][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:32:55,654][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:32:56,313][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:32:56,972][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:32:57,631][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:32:58,291][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:32:58,950][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:32:59,609][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:33:00,591][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:33:01,250][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:33:01,909][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:33:02,568][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:33:03,226][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:33:03,885][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:33:04,543][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:33:05,201][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:33:05,861][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:33:06,519][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:33:07,178][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:33:07,836][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:33:08,494][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:33:09,152][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:33:09,812][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:33:10,471][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:33:11,129][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:33:12,019][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 14:33:13,330][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:33:13,332][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:33:13,333][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:33:14,783][__main__][INFO] - Iteration 26 took 52s (9.37% Gen, 87.86% Train). Generation: 4s, Training: 45s. Estimated remaining time: 14h 5m 4s. Estimated total time: 14h 32m 13s. Time estimates for 10 more iterations: 8m 43s, 100 more iterations: 1h 27m 13s, 500 more iterations: 7h 16m 6s. [2026-03-25 14:33:14,785][__main__][INFO] - Starting iteration 26. [2026-03-25 14:33:14,789][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 14:33:14,789][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:33:20,677][__main__][INFO] - Number of regex retries in iteration 26: 0 [2026-03-25 14:33:20,679][__main__][INFO] - agents played in iteration 26 are Bob, Alice [2026-03-25 14:33:21,514][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:33:21,579][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:33:21,580][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:33:21,580][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:33:22,217][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:33:22,820][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:33:23,484][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:33:24,144][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:33:24,806][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:33:25,463][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:33:26,125][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:33:26,785][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:33:27,444][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:33:28,103][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:33:28,764][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:33:29,424][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:33:30,083][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:33:30,745][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:33:31,402][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:33:32,060][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:33:32,718][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:33:33,377][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:33:34,037][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:33:34,695][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:33:35,353][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:33:36,011][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:33:36,670][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:33:37,327][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:33:37,986][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:33:38,645][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:33:39,305][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:33:39,963][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:33:40,621][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:33:41,279][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:33:41,938][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:33:42,596][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:33:43,253][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:33:43,912][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:33:44,571][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:33:45,229][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:33:45,887][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:33:46,545][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:33:47,203][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:33:47,861][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:33:48,519][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:33:49,176][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:33:49,834][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:33:50,493][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:33:51,152][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:33:51,810][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:33:52,469][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:33:53,127][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:33:54,118][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:33:54,776][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:33:55,434][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:33:56,093][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:33:56,752][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:33:57,410][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:33:58,068][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:33:58,731][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:33:59,389][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:34:00,049][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:34:00,708][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:34:01,368][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:34:02,025][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:34:02,687][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:34:03,345][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:34:04,002][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:34:04,660][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:34:05,384][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 14:34:06,729][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:34:06,733][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:34:06,734][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:34:08,254][__main__][INFO] - Iteration 27 took 53s (11.02% Gen, 86.14% Train). Generation: 5s, Training: 46s. Estimated remaining time: 14h 23m 4s. Estimated total time: 14h 51m 7s. Time estimates for 10 more iterations: 8m 54s, 100 more iterations: 1h 29m 6s, 500 more iterations: 7h 25m 33s. [2026-03-25 14:34:08,258][__main__][INFO] - Starting iteration 27. [2026-03-25 14:34:08,263][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 14:34:08,263][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:34:14,496][__main__][INFO] - Number of regex retries in iteration 27: 0 [2026-03-25 14:34:14,496][__main__][INFO] - agents played in iteration 27 are Bob, Alice [2026-03-25 14:34:15,501][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:34:15,566][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:34:15,567][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:34:15,567][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:34:16,255][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:34:16,874][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:34:17,532][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:34:18,190][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:34:18,851][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:34:19,508][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:34:20,166][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:34:20,824][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:34:21,483][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:34:22,141][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:34:22,799][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:34:23,457][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:34:24,116][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:34:24,775][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:34:25,433][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:34:26,091][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:34:26,749][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:34:27,407][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:34:28,065][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:34:28,724][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:34:29,382][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:34:30,040][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:34:30,701][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:34:31,360][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:34:32,018][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:34:32,677][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:34:33,335][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:34:33,992][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:34:34,650][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:34:35,308][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:34:35,967][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:34:36,629][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:34:37,287][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:34:37,944][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:34:38,604][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:34:39,259][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:34:39,917][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:34:40,577][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:34:41,235][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:34:41,893][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:34:42,552][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:34:43,210][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:34:43,868][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:34:44,526][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:34:45,185][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:34:45,843][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:34:46,502][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:34:47,160][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:34:48,171][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:34:48,830][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:34:49,488][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:34:50,149][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:34:50,807][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:34:51,467][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:34:52,126][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:34:52,784][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:34:53,442][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:34:54,102][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:34:54,762][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:34:55,419][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:34:56,078][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:34:56,735][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:34:57,393][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:34:58,052][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:34:58,711][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:34:59,621][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 14:35:01,481][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:35:01,484][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:35:01,485][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:35:02,917][__main__][INFO] - Iteration 28 took 54s (11.40% Gen, 85.97% Train). Generation: 6s, Training: 46s. Estimated remaining time: 14h 41m 59s. Estimated total time: 15h 10m 56s. Time estimates for 10 more iterations: 9m 6s, 100 more iterations: 1h 31m 5s, 500 more iterations: 7h 35m 28s. [2026-03-25 14:35:02,919][__main__][INFO] - Starting iteration 28. [2026-03-25 14:35:02,924][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 14:35:02,924][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:35:07,979][__main__][INFO] - Number of regex retries in iteration 28: 0 [2026-03-25 14:35:07,980][__main__][INFO] - agents played in iteration 28 are Bob, Alice [2026-03-25 14:35:08,466][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:35:08,533][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:35:08,534][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:35:08,534][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:35:09,198][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:35:09,810][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:35:10,471][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:35:11,131][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:35:11,790][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:35:12,450][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:35:13,108][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:35:13,767][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:35:14,425][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:35:15,084][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:35:15,742][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:35:16,401][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:35:17,060][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:35:17,720][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:35:18,382][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:35:19,039][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:35:19,698][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:35:20,356][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:35:21,015][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:35:21,673][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:35:22,333][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:35:22,991][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:35:23,650][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:35:24,308][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:35:24,967][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:35:25,627][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:35:26,286][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:35:26,944][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:35:27,602][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:35:28,260][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:35:28,919][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:35:29,577][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:35:30,236][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:35:30,896][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:35:31,554][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:35:32,212][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:35:32,870][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:35:33,529][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:35:34,189][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:35:34,847][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:35:35,506][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:35:36,165][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:35:36,824][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:35:37,483][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:35:38,141][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:35:38,800][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:35:39,458][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:35:40,120][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:35:41,127][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:35:41,787][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:35:42,446][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:35:43,105][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:35:43,766][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:35:44,425][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:35:45,086][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:35:45,747][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:35:46,403][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:35:47,061][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:35:47,720][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:35:48,378][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:35:49,039][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:35:49,698][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:35:50,357][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:35:51,016][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:35:51,675][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:35:52,491][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 14:35:53,844][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:35:53,847][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:35:53,849][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:35:55,329][__main__][INFO] - Iteration 29 took 52s (9.65% Gen, 87.52% Train). Generation: 5s, Training: 45s. Estimated remaining time: 14h 3m 37s. Estimated total time: 14h 33m 27s. Time estimates for 10 more iterations: 8m 44s, 100 more iterations: 1h 27m 20s, 500 more iterations: 7h 16m 43s. [2026-03-25 14:35:55,331][__main__][INFO] - Starting iteration 29. [2026-03-25 14:35:55,335][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 14:35:55,335][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:36:00,572][__main__][INFO] - Number of regex retries in iteration 29: 0 [2026-03-25 14:36:00,572][__main__][INFO] - agents played in iteration 29 are Bob, Alice [2026-03-25 14:36:01,152][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:36:01,219][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:36:01,220][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:36:01,220][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:36:01,897][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:36:02,520][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:36:03,172][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:36:03,831][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:36:04,491][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:36:05,149][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:36:05,807][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:36:06,466][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:36:07,125][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:36:07,785][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:36:08,444][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:36:09,103][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:36:09,761][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:36:10,421][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:36:11,079][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:36:11,737][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:36:12,397][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:36:13,056][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:36:13,714][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:36:14,372][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:36:15,030][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:36:15,687][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:36:16,346][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:36:17,005][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:36:17,663][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:36:18,323][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:36:18,982][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:36:19,641][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:36:20,299][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:36:20,957][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:36:21,616][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:36:22,274][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:36:22,932][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:36:23,590][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:36:24,248][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:36:24,906][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:36:25,564][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:36:26,223][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:36:26,882][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:36:27,540][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:36:28,200][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:36:28,861][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:36:29,519][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:36:30,177][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:36:30,835][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:36:31,493][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:36:32,151][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:36:32,809][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:36:33,785][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:36:34,446][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:36:35,106][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:36:35,763][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:36:36,422][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:36:37,082][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:36:37,740][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:36:38,400][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:36:39,055][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:36:39,717][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:36:40,380][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:36:41,038][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:36:41,696][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:36:42,354][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:36:43,013][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:36:43,673][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:36:44,334][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:36:45,125][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 14:36:46,908][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:36:46,911][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:36:46,912][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:36:48,337][__main__][INFO] - Iteration 30 took 53s (9.88% Gen, 87.43% Train). Generation: 5s, Training: 46s. Estimated remaining time: 14h 12m 40s. Estimated total time: 14h 43m 23s. Time estimates for 10 more iterations: 8m 50s, 100 more iterations: 1h 28m 20s, 500 more iterations: 7h 21m 41s. [2026-03-25 14:36:48,339][__main__][INFO] - Starting iteration 30. [2026-03-25 14:36:48,343][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 14:36:48,344][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:36:54,571][__main__][INFO] - Number of regex retries in iteration 30: 0 [2026-03-25 14:36:54,572][__main__][INFO] - agents played in iteration 30 are Bob, Alice [2026-03-25 14:36:55,417][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:36:55,482][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:36:55,483][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:36:55,483][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:36:56,139][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:36:56,767][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:36:57,426][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:36:58,083][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:36:58,743][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:36:59,404][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:37:00,062][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:37:00,724][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:37:01,383][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:37:02,043][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:37:02,702][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:37:03,362][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:37:04,019][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:37:04,680][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:37:05,338][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:37:05,996][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:37:06,656][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:37:07,314][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:37:07,972][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:37:08,630][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:37:09,291][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:37:09,951][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:37:10,610][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:37:11,269][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:37:11,928][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:37:12,588][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:37:13,248][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:37:13,907][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:37:14,565][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:37:15,224][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:37:15,884][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:37:16,542][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:37:17,201][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:37:17,860][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:37:18,521][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:37:19,180][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:37:19,838][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:37:20,497][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:37:21,155][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:37:21,814][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:37:22,472][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:37:23,130][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:37:23,789][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:37:24,447][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:37:25,107][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:37:25,767][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:37:26,425][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:37:27,083][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:37:28,069][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:37:28,730][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:37:29,388][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:37:30,048][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:37:30,707][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:37:31,365][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:37:32,024][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:37:32,682][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:37:33,340][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:37:33,998][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:37:34,657][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:37:35,316][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:37:35,974][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:37:36,632][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:37:37,290][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:37:37,948][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:37:38,606][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:37:39,465][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 14:37:40,896][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:37:40,898][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:37:40,899][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:37:42,255][__main__][INFO] - Iteration 31 took 53s (11.55% Gen, 85.93% Train). Generation: 6s, Training: 46s. Estimated remaining time: 14h 26m 57s. Estimated total time: 14h 58m 33s. Time estimates for 10 more iterations: 8m 59s, 100 more iterations: 1h 29m 51s, 500 more iterations: 7h 29m 16s. [2026-03-25 14:37:42,258][__main__][INFO] - Starting iteration 31. [2026-03-25 14:37:42,262][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 14:37:42,262][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:37:48,082][__main__][INFO] - Number of regex retries in iteration 31: 0 [2026-03-25 14:37:48,083][__main__][INFO] - agents played in iteration 31 are Bob, Alice [2026-03-25 14:37:49,057][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:37:49,122][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:37:49,123][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:37:49,123][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:37:49,951][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:37:50,570][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:37:51,229][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:37:51,887][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:37:52,545][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:37:53,202][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:37:53,863][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:37:54,523][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:37:55,183][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:37:55,843][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:37:56,505][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:37:57,164][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:37:57,824][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:37:58,483][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:37:59,140][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:37:59,798][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:38:00,456][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:38:01,116][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:38:01,774][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:38:02,432][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:38:03,090][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:38:03,748][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:38:04,407][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:38:05,065][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:38:05,724][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:38:06,382][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:38:07,041][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:38:07,698][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:38:08,356][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:38:09,014][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:38:09,673][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:38:10,330][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:38:10,989][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:38:11,649][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:38:12,306][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:38:12,964][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:38:13,622][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:38:14,280][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:38:14,937][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:38:15,595][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:38:16,253][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:38:16,911][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:38:17,569][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:38:18,228][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:38:18,886][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:38:19,544][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:38:20,202][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:38:20,861][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:38:21,855][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:38:22,514][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:38:23,173][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:38:23,832][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:38:24,491][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:38:25,150][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:38:25,809][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:38:26,469][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:38:27,129][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:38:27,788][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:38:28,451][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:38:29,106][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:38:29,765][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:38:30,425][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:38:31,083][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:38:31,742][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:38:32,400][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:38:33,112][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 14:38:34,653][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:38:34,656][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:38:34,657][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:38:35,980][__main__][INFO] - Iteration 32 took 53s (10.84% Gen, 86.70% Train). Generation: 5s, Training: 46s. Estimated remaining time: 14h 22m 49s. Estimated total time: 14h 55m 19s. Time estimates for 10 more iterations: 8m 57s, 100 more iterations: 1h 29m 31s, 500 more iterations: 7h 27m 39s. [2026-03-25 14:38:35,982][__main__][INFO] - Starting iteration 32. [2026-03-25 14:38:35,986][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 14:38:35,986][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:38:42,143][__main__][INFO] - Number of regex retries in iteration 32: 0 [2026-03-25 14:38:42,144][__main__][INFO] - agents played in iteration 32 are Bob, Alice [2026-03-25 14:38:42,642][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:38:42,708][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:38:42,708][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:38:42,709][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:38:43,535][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:38:44,145][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:38:44,806][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:38:45,465][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:38:46,124][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:38:46,783][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:38:47,442][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:38:48,101][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:38:48,760][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:38:49,420][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:38:50,078][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:38:50,737][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:38:51,397][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:38:52,057][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:38:52,716][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:38:53,376][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:38:54,036][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:38:54,696][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:38:55,359][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:38:56,018][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:38:56,677][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:38:57,335][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:38:57,994][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:38:58,657][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:38:59,315][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:38:59,973][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:39:00,633][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:39:01,291][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:39:01,949][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:39:02,610][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:39:03,269][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:39:03,927][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:39:04,584][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:39:05,242][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:39:05,900][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:39:06,558][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:39:07,216][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:39:07,874][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:39:08,532][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:39:09,189][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:39:09,847][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:39:10,508][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:39:11,168][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:39:11,827][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:39:12,485][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:39:13,143][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:39:13,801][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:39:14,458][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:39:15,439][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:39:16,098][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:39:16,757][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:39:17,416][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:39:18,074][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:39:18,732][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:39:19,392][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:39:20,051][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:39:20,709][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:39:21,366][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:39:22,024][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:39:22,683][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:39:23,341][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:39:23,999][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:39:24,661][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:39:25,320][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:39:25,978][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:39:26,728][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 14:39:28,066][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:39:28,069][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:39:28,070][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:39:29,565][__main__][INFO] - Iteration 33 took 53s (11.49% Gen, 85.71% Train). Generation: 6s, Training: 45s. Estimated remaining time: 14h 19m 37s. Estimated total time: 14h 53m 1s. Time estimates for 10 more iterations: 8m 55s, 100 more iterations: 1h 29m 18s, 500 more iterations: 7h 26m 30s. [2026-03-25 14:39:29,567][__main__][INFO] - Starting iteration 33. [2026-03-25 14:39:29,572][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 14:39:29,573][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:39:34,693][__main__][INFO] - Number of regex retries in iteration 33: 0 [2026-03-25 14:39:34,694][__main__][INFO] - agents played in iteration 33 are Bob, Alice [2026-03-25 14:39:35,155][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:39:35,223][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:39:35,223][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:39:35,224][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:39:35,934][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:39:36,549][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:39:37,209][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:39:37,869][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:39:38,528][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:39:39,189][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:39:39,849][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:39:40,508][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:39:41,167][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:39:41,825][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:39:42,484][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:39:43,143][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:39:43,801][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:39:44,459][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:39:45,117][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:39:45,776][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:39:46,434][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:39:47,094][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:39:47,752][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:39:48,414][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:39:49,072][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:39:49,729][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:39:50,388][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:39:51,048][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:39:51,709][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:39:52,367][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:39:53,027][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:39:53,687][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:39:54,345][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:39:55,003][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:39:55,661][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:39:56,320][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:39:56,978][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:39:57,636][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:39:58,295][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:39:58,955][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:39:59,613][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:40:00,272][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:40:00,930][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:40:01,587][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:40:02,246][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:40:02,905][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:40:03,564][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:40:04,221][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:40:04,879][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:40:05,537][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:40:06,196][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:40:06,854][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:40:07,836][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:40:08,495][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:40:09,153][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:40:09,811][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:40:10,469][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:40:11,127][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:40:11,787][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:40:12,447][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:40:13,106][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:40:13,763][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:40:14,422][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:40:15,079][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:40:15,737][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:40:16,395][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:40:17,053][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:40:17,711][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:40:18,369][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:40:19,146][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 14:40:20,624][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:40:20,627][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:40:20,628][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:40:22,037][__main__][INFO] - Iteration 34 took 52s (9.76% Gen, 87.55% Train). Generation: 5s, Training: 45s. Estimated remaining time: 14h 0m 11s. Estimated total time: 14h 34m 28s. Time estimates for 10 more iterations: 8m 44s, 100 more iterations: 1h 27m 26s, 500 more iterations: 7h 17m 14s. [2026-03-25 14:40:22,039][__main__][INFO] - Starting iteration 34. [2026-03-25 14:40:22,044][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 14:40:22,045][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:40:27,370][__main__][INFO] - Number of regex retries in iteration 34: 0 [2026-03-25 14:40:27,371][__main__][INFO] - agents played in iteration 34 are Bob, Alice [2026-03-25 14:40:27,930][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:40:27,995][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:40:27,995][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:40:27,996][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:40:28,885][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:40:29,501][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:40:30,164][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:40:30,823][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:40:31,482][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:40:32,142][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:40:32,800][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:40:33,461][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:40:34,119][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:40:34,779][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:40:35,437][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:40:36,096][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:40:36,755][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:40:37,414][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:40:38,072][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:40:38,732][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:40:39,392][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:40:40,051][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:40:40,711][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:40:41,371][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:40:42,030][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:40:42,689][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:40:43,348][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:40:44,007][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:40:44,667][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:40:45,326][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:40:45,985][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:40:46,645][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:40:47,304][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:40:47,964][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:40:48,623][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:40:49,283][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:40:49,943][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:40:50,602][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:40:51,260][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:40:51,919][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:40:52,578][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:40:53,237][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:40:53,896][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:40:54,555][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:40:55,213][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:40:55,872][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:40:56,532][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:40:57,191][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:40:57,849][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:40:58,508][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:40:59,166][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:40:59,824][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:41:00,809][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:41:01,469][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:41:02,128][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:41:02,785][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:41:03,443][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:41:04,101][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:41:04,759][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:41:05,419][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:41:06,077][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:41:06,735][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:41:07,392][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:41:08,050][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:41:08,713][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:41:09,371][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:41:10,030][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:41:10,688][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:41:11,346][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:41:12,106][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 14:41:14,184][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:41:14,188][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:41:14,189][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:41:17,240][__main__][INFO] - Iteration 35 took 55s (9.65% Gen, 84.82% Train). Generation: 5s, Training: 46s. Estimated remaining time: 14h 44m 46s. Estimated total time: 15h 19m 58s. Time estimates for 10 more iterations: 9m 11s, 100 more iterations: 1h 31m 59s, 500 more iterations: 7h 39m 59s. [2026-03-25 14:41:17,242][__main__][INFO] - Starting iteration 35. [2026-03-25 14:41:17,247][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 14:41:17,247][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:41:22,194][__main__][INFO] - Number of regex retries in iteration 35: 0 [2026-03-25 14:41:22,195][__main__][INFO] - agents played in iteration 35 are Bob, Alice [2026-03-25 14:41:22,689][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:41:22,755][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:41:22,756][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:41:22,756][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:41:23,521][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:41:24,159][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:41:24,820][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:41:25,481][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:41:26,142][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:41:26,801][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:41:27,460][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:41:28,121][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:41:28,782][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:41:29,441][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:41:30,099][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:41:30,758][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:41:31,416][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:41:32,075][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:41:32,741][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:41:33,399][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:41:34,058][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:41:34,717][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:41:35,376][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:41:36,036][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:41:36,696][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:41:37,355][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:41:38,013][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:41:38,671][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:41:39,330][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:41:39,988][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:41:40,647][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:41:41,305][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:41:41,965][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:41:42,624][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:41:43,284][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:41:43,945][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:41:44,604][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:41:45,263][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:41:45,922][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:41:46,581][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:41:47,239][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:41:47,899][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:41:48,560][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:41:49,220][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:41:49,879][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:41:50,537][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:41:51,196][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:41:51,855][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:41:52,513][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:41:53,172][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:41:53,831][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:41:54,490][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:41:55,489][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:41:56,149][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:41:56,807][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:41:57,466][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:41:58,128][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:41:58,787][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:41:59,446][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:42:00,105][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:42:00,763][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:42:01,422][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:42:02,080][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:42:02,739][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:42:03,398][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:42:04,057][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:42:04,714][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:42:05,373][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:42:06,032][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:42:06,840][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 14:42:08,827][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:42:08,830][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:42:08,831][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:42:10,240][__main__][INFO] - Iteration 36 took 52s (9.34% Gen, 88.00% Train). Generation: 4s, Training: 46s. Estimated remaining time: 14h 7m 10s. Estimated total time: 14h 43m 15s. Time estimates for 10 more iterations: 8m 49s, 100 more iterations: 1h 28m 19s, 500 more iterations: 7h 21m 37s. [2026-03-25 14:42:10,242][__main__][INFO] - Starting iteration 36. [2026-03-25 14:42:10,247][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 14:42:10,248][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:42:16,108][__main__][INFO] - Number of regex retries in iteration 36: 0 [2026-03-25 14:42:16,109][__main__][INFO] - agents played in iteration 36 are Bob, Alice [2026-03-25 14:42:16,947][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:42:17,012][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:42:17,012][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:42:17,013][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:42:17,863][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:42:18,490][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:42:19,151][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:42:19,812][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:42:20,472][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:42:21,133][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:42:21,793][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:42:22,452][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:42:23,111][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:42:23,771][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:42:24,430][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:42:25,089][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:42:25,748][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:42:26,407][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:42:27,066][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:42:27,724][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:42:28,385][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:42:29,043][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:42:29,702][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:42:30,360][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:42:31,019][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:42:31,679][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:42:32,338][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:42:32,997][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:42:33,655][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:42:34,313][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:42:34,972][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:42:35,633][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:42:36,293][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:42:36,952][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:42:37,612][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:42:38,272][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:42:38,932][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:42:39,591][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:42:40,250][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:42:40,909][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:42:41,567][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:42:42,227][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:42:42,887][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:42:43,546][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:42:44,205][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:42:44,864][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:42:45,522][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:42:46,181][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:42:46,841][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:42:47,500][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:42:48,159][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:42:48,819][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:42:49,807][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:42:50,466][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:42:51,124][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:42:51,782][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:42:52,440][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:42:53,099][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:42:53,759][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:42:54,416][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:42:55,074][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:42:55,733][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:42:56,393][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:42:57,052][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:42:57,711][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:42:58,369][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:42:59,028][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:42:59,687][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:43:00,346][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:43:01,094][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 14:43:02,438][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:43:02,440][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:43:02,441][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:43:03,828][__main__][INFO] - Iteration 37 took 53s (10.94% Gen, 86.47% Train). Generation: 5s, Training: 46s. Estimated remaining time: 14h 16m 4s. Estimated total time: 14h 53m 2s. Time estimates for 10 more iterations: 8m 55s, 100 more iterations: 1h 29m 18s, 500 more iterations: 7h 26m 31s. [2026-03-25 14:43:03,830][__main__][INFO] - Starting iteration 37. [2026-03-25 14:43:03,835][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 14:43:03,835][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:43:09,541][__main__][INFO] - Number of regex retries in iteration 37: 0 [2026-03-25 14:43:09,541][__main__][INFO] - agents played in iteration 37 are Bob, Alice [2026-03-25 14:43:10,431][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:43:10,497][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:43:10,497][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:43:10,498][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:43:11,309][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:43:11,919][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:43:12,582][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:43:13,245][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:43:13,906][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:43:14,566][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:43:15,226][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:43:15,889][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:43:16,549][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:43:17,208][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:43:17,867][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:43:18,526][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:43:19,185][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:43:19,844][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:43:20,502][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:43:21,162][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:43:21,821][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:43:22,479][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:43:23,138][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:43:23,796][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:43:24,455][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:43:25,115][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:43:25,774][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:43:26,432][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:43:27,091][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:43:27,749][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:43:28,409][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:43:29,069][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:43:29,727][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:43:30,387][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:43:31,047][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:43:31,705][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:43:32,363][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:43:33,022][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:43:33,680][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:43:34,338][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:43:34,997][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:43:35,656][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:43:36,317][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:43:36,976][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:43:37,635][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:43:38,296][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:43:38,955][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:43:39,614][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:43:40,273][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:43:40,931][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:43:41,590][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:43:42,249][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:43:43,261][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:43:43,920][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:43:44,578][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:43:45,236][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:43:45,894][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:43:46,553][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:43:47,212][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:43:47,870][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:43:48,529][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:43:49,188][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:43:49,847][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:43:50,505][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:43:51,162][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:43:51,820][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:43:52,478][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:43:53,138][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:43:53,796][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:43:54,782][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 14:43:56,772][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:43:56,774][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:43:56,775][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:43:58,142][__main__][INFO] - Iteration 38 took 54s (10.51% Gen, 86.97% Train). Generation: 5s, Training: 47s. Estimated remaining time: 14h 27m 17s. Estimated total time: 15h 5m 9s. Time estimates for 10 more iterations: 9m 3s, 100 more iterations: 1h 30m 30s, 500 more iterations: 7h 32m 34s. [2026-03-25 14:43:58,144][__main__][INFO] - Starting iteration 38. [2026-03-25 14:43:58,148][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 14:43:58,148][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:44:03,973][__main__][INFO] - Number of regex retries in iteration 38: 0 [2026-03-25 14:44:03,974][__main__][INFO] - agents played in iteration 38 are Bob, Alice [2026-03-25 14:44:05,003][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:44:05,068][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:44:05,069][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:44:05,070][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:44:05,947][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:44:06,559][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:44:07,220][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:44:07,881][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:44:08,542][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:44:09,202][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:44:09,861][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:44:10,522][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:44:11,182][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:44:11,840][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:44:12,499][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:44:13,158][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:44:13,816][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:44:14,476][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:44:15,134][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:44:15,795][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:44:16,455][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:44:17,115][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:44:17,774][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:44:18,434][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:44:19,093][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:44:19,752][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:44:20,411][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:44:21,071][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:44:21,729][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:44:22,388][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:44:23,048][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:44:23,706][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:44:24,365][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:44:25,024][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:44:25,683][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:44:26,342][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:44:27,000][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:44:27,659][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:44:28,318][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:44:28,978][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:44:29,636][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:44:30,295][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:44:30,955][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:44:31,613][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:44:32,272][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:44:32,931][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:44:33,589][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:44:34,248][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:44:34,908][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:44:35,568][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:44:36,227][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:44:36,886][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:44:37,878][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:44:38,536][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:44:39,194][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:44:39,853][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:44:40,511][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:44:41,168][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:44:41,826][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:44:42,485][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:44:43,144][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:44:43,804][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:44:44,464][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:44:45,123][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:44:45,782][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:44:46,440][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:44:47,098][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:44:47,757][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:44:48,416][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:44:49,144][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 14:44:50,547][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:44:50,550][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:44:50,551][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:44:52,001][__main__][INFO] - Iteration 39 took 53s (10.82% Gen, 86.49% Train). Generation: 5s, Training: 46s. Estimated remaining time: 14h 18m 47s. Estimated total time: 14h 57m 34s. Time estimates for 10 more iterations: 8m 58s, 100 more iterations: 1h 29m 45s, 500 more iterations: 7h 28m 47s. [2026-03-25 14:44:52,003][__main__][INFO] - Starting iteration 39. [2026-03-25 14:44:52,007][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 14:44:52,008][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:44:57,099][__main__][INFO] - Number of regex retries in iteration 39: 0 [2026-03-25 14:44:57,100][__main__][INFO] - agents played in iteration 39 are Bob, Alice [2026-03-25 14:44:57,557][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:44:57,622][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:44:57,623][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:44:57,623][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:44:58,419][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:44:59,036][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:44:59,697][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:45:00,356][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:45:01,018][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:45:01,676][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:45:02,334][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:45:02,993][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:45:03,652][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:45:04,310][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:45:04,968][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:45:05,626][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:45:06,284][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:45:06,942][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:45:07,601][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:45:08,260][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:45:08,917][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:45:09,575][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:45:10,233][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:45:10,891][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:45:11,550][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:45:12,208][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:45:12,866][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:45:13,524][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:45:14,182][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:45:14,840][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:45:15,498][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:45:16,156][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:45:16,814][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:45:17,472][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:45:18,130][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:45:18,788][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:45:19,447][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:45:20,105][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:45:20,764][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:45:21,422][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:45:22,080][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:45:22,737][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:45:23,396][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:45:24,053][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:45:24,710][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:45:25,368][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:45:26,026][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:45:26,683][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:45:27,342][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:45:28,002][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:45:28,662][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:45:29,321][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:45:30,313][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:45:30,973][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:45:31,633][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:45:32,291][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:45:32,950][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:45:33,609][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:45:34,268][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:45:34,926][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:45:35,584][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:45:36,243][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:45:36,902][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:45:37,560][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:45:38,218][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:45:38,877][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:45:39,535][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:45:40,193][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:45:40,851][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:45:41,646][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 14:45:42,966][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:45:42,969][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:45:42,970][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:45:44,418][__main__][INFO] - Iteration 40 took 52s (9.72% Gen, 87.52% Train). Generation: 5s, Training: 45s. Estimated remaining time: 13h 53m 53s. Estimated total time: 14h 33m 32s. Time estimates for 10 more iterations: 8m 44s, 100 more iterations: 1h 27m 21s, 500 more iterations: 7h 16m 46s. [2026-03-25 14:45:44,420][__main__][INFO] - Starting iteration 40. [2026-03-25 14:45:44,423][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 14:45:44,424][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:45:49,194][__main__][INFO] - Number of regex retries in iteration 40: 0 [2026-03-25 14:45:49,196][__main__][INFO] - agents played in iteration 40 are Bob, Alice [2026-03-25 14:45:49,654][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:45:49,720][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:45:49,720][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:45:49,721][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:45:50,493][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:45:51,105][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:45:51,765][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:45:52,424][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:45:53,087][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:45:53,745][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:45:54,407][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:45:55,066][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:45:55,726][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:45:56,384][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:45:57,044][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:45:57,703][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:45:58,362][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:45:59,021][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:45:59,680][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:46:00,339][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:46:00,998][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:46:01,658][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:46:02,316][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:46:02,977][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:46:03,637][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:46:04,295][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:46:04,954][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:46:05,614][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:46:06,273][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:46:06,932][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:46:07,591][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:46:08,250][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:46:08,908][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:46:09,567][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:46:10,227][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:46:10,887][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:46:11,546][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:46:12,205][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:46:12,864][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:46:13,524][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:46:14,182][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:46:14,841][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:46:15,501][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:46:16,161][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:46:16,819][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:46:17,478][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:46:18,136][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:46:18,795][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:46:19,456][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:46:20,115][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:46:20,773][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:46:21,432][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:46:22,414][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:46:23,075][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:46:23,734][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:46:24,393][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:46:25,052][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:46:25,710][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:46:26,368][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:46:27,027][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:46:27,687][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:46:28,345][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:46:29,005][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:46:29,663][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:46:30,320][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:46:30,978][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:46:31,638][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:46:32,296][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:46:32,954][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:46:33,717][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 14:46:35,078][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:46:35,081][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:46:35,082][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:46:36,498][__main__][INFO] - Iteration 41 took 52s (9.16% Gen, 88.11% Train). Generation: 4s, Training: 45s. Estimated remaining time: 13h 47m 25s. Estimated total time: 14h 27m 56s. Time estimates for 10 more iterations: 8m 40s, 100 more iterations: 1h 26m 47s, 500 more iterations: 7h 13m 58s. [2026-03-25 14:46:36,500][__main__][INFO] - Starting iteration 41. [2026-03-25 14:46:36,504][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 14:46:36,505][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:46:41,132][__main__][INFO] - Number of regex retries in iteration 41: 0 [2026-03-25 14:46:41,133][__main__][INFO] - agents played in iteration 41 are Bob, Alice [2026-03-25 14:46:41,589][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:46:41,655][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:46:41,655][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:46:41,656][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:46:42,468][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:46:43,087][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:46:43,751][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:46:44,411][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:46:45,072][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:46:45,733][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:46:46,392][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:46:47,055][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:46:47,714][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:46:48,373][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:46:49,033][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:46:49,692][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:46:50,352][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:46:51,013][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:46:51,673][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:46:52,333][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:46:52,993][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:46:53,653][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:46:54,312][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:46:54,974][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:46:55,634][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:46:56,293][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:46:56,952][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:46:57,611][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:46:58,269][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:46:58,929][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:46:59,589][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:47:00,249][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:47:00,907][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:47:01,568][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:47:02,227][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:47:02,886][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:47:03,545][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:47:04,203][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:47:04,863][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:47:05,521][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:47:06,179][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:47:06,839][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:47:07,498][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:47:08,158][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:47:08,818][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:47:09,483][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:47:10,141][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:47:10,801][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:47:11,461][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:47:12,120][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:47:12,779][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:47:13,440][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:47:14,424][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:47:15,083][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:47:15,741][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:47:16,399][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:47:17,058][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:47:17,716][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:47:18,376][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:47:19,035][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:47:19,692][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:47:20,350][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:47:21,010][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:47:21,669][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:47:22,327][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:47:22,987][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:47:23,647][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:47:24,307][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:47:24,965][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:47:25,736][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 14:47:27,081][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:47:27,083][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:47:27,084][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:47:28,518][__main__][INFO] - Iteration 42 took 52s (8.90% Gen, 88.34% Train). Generation: 4s, Training: 45s. Estimated remaining time: 13h 45m 33s. Estimated total time: 14h 26m 56s. Time estimates for 10 more iterations: 8m 40s, 100 more iterations: 1h 26m 41s, 500 more iterations: 7h 13m 28s. [2026-03-25 14:47:28,520][__main__][INFO] - Starting iteration 42. [2026-03-25 14:47:28,523][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 14:47:28,524][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:47:33,180][__main__][INFO] - Number of regex retries in iteration 42: 0 [2026-03-25 14:47:33,181][__main__][INFO] - agents played in iteration 42 are Bob, Alice [2026-03-25 14:47:33,655][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:47:33,721][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:47:33,722][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:47:33,722][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:47:34,464][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:47:35,092][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:47:35,752][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:47:36,411][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:47:37,069][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:47:37,727][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:47:38,389][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:47:39,046][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:47:39,706][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:47:40,367][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:47:41,029][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:47:41,690][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:47:42,349][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:47:43,009][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:47:43,671][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:47:44,331][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:47:44,990][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:47:45,650][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:47:46,309][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:47:46,969][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:47:47,629][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:47:48,290][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:47:48,949][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:47:49,609][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:47:50,268][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:47:50,928][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:47:51,588][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:47:52,247][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:47:52,907][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:47:53,567][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:47:54,226][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:47:54,885][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:47:55,545][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:47:56,205][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:47:56,864][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:47:57,522][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:47:58,181][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:47:58,841][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:47:59,502][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:48:00,161][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:48:00,821][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:48:01,480][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:48:02,140][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:48:02,801][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:48:03,461][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:48:04,120][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:48:04,780][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:48:05,440][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:48:06,460][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:48:07,122][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:48:07,782][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:48:08,441][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:48:09,100][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:48:09,759][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:48:10,419][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:48:11,079][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:48:11,739][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:48:12,397][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:48:13,057][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:48:13,718][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:48:14,376][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:48:15,035][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:48:15,694][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:48:16,353][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:48:17,011][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:48:17,795][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 14:48:19,593][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:48:19,596][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:48:19,597][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:48:25,166][__main__][INFO] - Iteration 43 took 56s (8.22% Gen, 81.94% Train). Generation: 4s, Training: 46s. Estimated remaining time: 15h 1m 44s. Estimated total time: 15h 44m 3s. Time estimates for 10 more iterations: 9m 26s, 100 more iterations: 1h 34m 24s, 500 more iterations: 7h 52m 1s. [2026-03-25 14:48:25,168][__main__][INFO] - Starting iteration 43. [2026-03-25 14:48:25,173][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 14:48:25,173][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:48:38,376][__main__][INFO] - Number of regex retries in iteration 43: 0 [2026-03-25 14:48:38,377][__main__][INFO] - agents played in iteration 43 are Bob, Alice [2026-03-25 14:48:39,405][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:48:39,480][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:48:39,481][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:48:39,481][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:48:40,257][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:48:40,895][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:48:41,525][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:48:42,184][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:48:42,842][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:48:43,502][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:48:44,162][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:48:44,821][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:48:45,479][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:48:46,138][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:48:46,797][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:48:47,456][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:48:48,116][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:48:48,774][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:48:49,434][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:48:50,095][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:48:50,754][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:48:51,414][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:48:52,073][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:48:52,733][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:48:53,393][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:48:54,050][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:48:54,708][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:48:55,366][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:48:56,026][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:48:56,684][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:48:57,342][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:48:58,002][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:48:58,664][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:48:59,322][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:48:59,982][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:49:00,640][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:49:01,299][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:49:01,957][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:49:02,615][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:49:03,273][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:49:03,932][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:49:04,591][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:49:05,250][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:49:05,909][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:49:06,569][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:49:07,227][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:49:07,885][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:49:08,544][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:49:09,203][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:49:09,862][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:49:10,521][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:49:11,179][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:49:12,168][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:49:12,828][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:49:13,487][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:49:14,147][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:49:14,806][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:49:15,464][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:49:16,125][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:49:16,784][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:49:17,444][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:49:18,102][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:49:18,760][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:49:19,418][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:49:20,076][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:49:20,734][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:49:21,393][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:49:22,053][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:49:22,712][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:49:23,517][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 14:49:24,921][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:49:24,924][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:49:24,926][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:49:26,263][__main__][INFO] - Iteration 44 took 1m 1s (21.61% Gen, 76.19% Train). Generation: 13s, Training: 46s. Estimated remaining time: 16h 14m 52s. Estimated total time: 16h 58m 12s. Time estimates for 10 more iterations: 10m 10s, 100 more iterations: 1h 41m 49s, 500 more iterations: 8h 29m 6s. [2026-03-25 14:49:26,265][__main__][INFO] - Starting iteration 44. [2026-03-25 14:49:26,268][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 14:49:26,269][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:49:31,079][__main__][INFO] - Number of regex retries in iteration 44: 0 [2026-03-25 14:49:31,079][__main__][INFO] - agents played in iteration 44 are Bob, Alice [2026-03-25 14:49:31,554][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:49:31,619][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:49:31,619][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:49:31,620][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:49:32,269][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:49:32,887][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:49:33,549][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:49:34,208][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:49:34,868][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:49:35,527][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:49:36,187][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:49:36,846][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:49:37,506][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:49:38,167][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:49:38,826][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:49:39,486][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:49:40,145][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:49:40,806][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:49:41,466][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:49:42,129][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:49:42,790][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:49:43,452][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:49:44,113][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:49:44,772][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:49:45,434][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:49:46,095][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:49:46,755][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:49:47,414][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:49:48,075][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:49:48,735][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:49:49,394][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:49:50,054][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:49:50,713][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:49:51,373][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:49:52,032][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:49:52,691][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:49:53,353][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:49:54,012][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:49:54,671][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:49:55,329][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:49:55,989][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:49:56,648][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:49:57,307][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:49:57,966][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:49:58,627][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:49:59,286][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:49:59,947][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:50:00,606][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:50:01,266][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:50:01,926][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:50:02,585][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:50:03,243][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:50:04,222][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:50:04,883][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:50:05,542][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:50:06,202][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:50:06,861][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:50:07,520][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:50:08,180][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:50:08,839][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:50:09,499][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:50:10,159][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:50:10,819][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:50:11,477][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:50:12,135][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:50:12,794][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:50:13,453][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:50:14,112][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:50:14,771][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:50:15,537][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 14:50:16,886][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:50:16,888][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:50:16,890][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:50:18,426][__main__][INFO] - Iteration 45 took 52s (9.22% Gen, 87.83% Train). Generation: 4s, Training: 45s. Estimated remaining time: 13h 45m 5s. Estimated total time: 14h 29m 18s. Time estimates for 10 more iterations: 8m 41s, 100 more iterations: 1h 26m 55s, 500 more iterations: 7h 14m 39s. [2026-03-25 14:50:18,428][__main__][INFO] - Starting iteration 45. [2026-03-25 14:50:18,432][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 14:50:18,432][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:50:23,555][__main__][INFO] - Number of regex retries in iteration 45: 0 [2026-03-25 14:50:23,556][__main__][INFO] - agents played in iteration 45 are Bob, Alice [2026-03-25 14:50:24,023][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:50:24,090][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:50:24,091][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:50:24,091][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:50:24,936][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:50:25,539][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:50:26,199][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:50:26,857][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:50:27,516][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:50:28,174][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:50:28,833][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:50:29,492][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:50:30,151][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:50:30,810][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:50:31,468][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:50:32,128][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:50:32,788][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:50:33,447][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:50:34,106][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:50:34,766][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:50:35,426][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:50:36,086][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:50:36,747][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:50:37,405][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:50:38,067][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:50:38,726][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:50:39,386][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:50:40,044][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:50:40,705][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:50:41,363][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:50:42,021][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:50:42,680][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:50:43,339][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:50:43,998][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:50:44,657][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:50:45,316][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:50:45,975][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:50:46,634][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:50:47,294][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:50:47,953][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:50:48,612][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:50:49,272][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:50:49,931][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:50:50,590][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:50:51,248][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:50:51,909][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:50:52,568][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:50:53,227][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:50:53,885][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:50:54,543][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:50:55,201][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:50:55,859][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:50:56,902][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:50:57,562][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:50:58,221][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:50:58,879][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:50:59,537][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:51:00,195][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:51:00,855][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:51:01,513][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:51:02,172][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:51:02,830][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:51:03,489][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:51:04,149][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:51:04,809][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:51:05,467][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:51:06,125][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:51:06,783][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:51:07,442][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:51:08,430][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 14:51:09,819][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:51:09,821][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:51:09,822][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:51:11,257][__main__][INFO] - Iteration 46 took 52s (9.70% Gen, 87.63% Train). Generation: 5s, Training: 46s. Estimated remaining time: 13h 55m 21s. Estimated total time: 14h 40m 27s. Time estimates for 10 more iterations: 8m 48s, 100 more iterations: 1h 28m 2s, 500 more iterations: 7h 20m 13s. [2026-03-25 14:51:11,259][__main__][INFO] - Starting iteration 46. [2026-03-25 14:51:11,263][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 14:51:11,263][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:51:16,161][__main__][INFO] - Number of regex retries in iteration 46: 0 [2026-03-25 14:51:16,162][__main__][INFO] - agents played in iteration 46 are Bob, Alice [2026-03-25 14:51:16,654][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:51:16,722][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:51:16,722][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:51:16,723][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:51:17,518][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:51:18,136][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:51:18,796][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:51:19,455][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:51:20,114][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:51:20,773][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:51:21,431][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:51:22,090][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:51:22,749][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:51:23,408][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:51:24,068][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:51:24,730][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:51:25,392][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:51:26,053][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:51:26,716][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:51:27,378][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:51:28,038][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:51:28,699][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:51:29,360][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:51:30,022][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:51:30,682][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:51:31,343][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:51:32,002][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:51:32,662][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:51:33,322][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:51:33,983][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:51:34,643][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:51:35,303][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:51:35,961][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:51:36,620][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:51:37,280][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:51:37,939][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:51:38,598][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:51:39,260][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:51:39,922][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:51:40,581][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:51:41,241][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:51:41,901][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:51:42,559][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:51:43,219][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:51:43,878][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:51:44,538][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:51:45,196][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:51:45,856][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:51:46,515][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:51:47,174][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:51:47,833][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:51:48,492][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:51:49,490][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:51:50,150][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:51:50,809][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:51:51,470][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:51:52,132][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:51:52,793][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:51:53,452][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:51:54,111][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:51:54,770][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:51:55,428][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:51:56,089][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:51:56,746][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:51:57,405][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:51:58,063][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:51:58,724][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:51:59,383][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:52:00,041][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:52:00,898][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 14:52:02,256][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:52:02,258][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:52:02,260][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:52:03,651][__main__][INFO] - Iteration 47 took 52s (9.35% Gen, 87.99% Train). Generation: 4s, Training: 46s. Estimated remaining time: 13h 47m 11s. Estimated total time: 14h 33m 9s. Time estimates for 10 more iterations: 8m 43s, 100 more iterations: 1h 27m 18s, 500 more iterations: 7h 16m 34s. [2026-03-25 14:52:03,653][__main__][INFO] - Starting iteration 47. [2026-03-25 14:52:03,657][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 14:52:03,658][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:52:08,576][__main__][INFO] - Number of regex retries in iteration 47: 0 [2026-03-25 14:52:08,577][__main__][INFO] - agents played in iteration 47 are Bob, Alice [2026-03-25 14:52:09,117][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:52:09,183][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:52:09,183][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:52:09,184][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:52:09,985][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:52:10,602][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:52:11,263][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:52:11,923][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:52:12,585][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:52:13,246][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:52:13,905][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:52:14,565][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:52:15,225][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:52:15,883][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:52:16,547][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:52:17,207][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:52:17,867][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:52:18,526][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:52:19,186][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:52:19,846][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:52:20,508][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:52:21,167][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:52:21,827][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:52:22,486][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:52:23,145][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:52:23,804][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:52:24,463][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:52:25,122][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:52:25,782][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:52:26,442][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:52:27,101][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:52:27,760][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:52:28,420][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:52:29,080][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:52:29,739][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:52:30,398][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:52:31,057][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:52:31,715][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:52:32,375][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:52:33,035][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:52:33,693][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:52:34,353][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:52:35,012][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:52:35,672][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:52:36,331][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:52:36,991][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:52:37,651][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:52:38,310][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:52:38,969][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:52:39,628][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:52:40,287][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:52:40,946][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:52:41,933][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:52:42,593][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:52:43,252][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:52:43,912][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:52:44,570][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:52:45,228][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:52:45,887][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:52:46,547][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:52:47,205][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:52:47,863][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:52:48,523][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:52:49,183][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:52:49,841][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:52:50,499][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:52:51,158][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:52:51,816][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:52:52,474][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:52:53,298][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 14:52:54,647][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:52:54,649][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:52:54,651][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:52:56,045][__main__][INFO] - Iteration 48 took 52s (9.39% Gen, 87.94% Train). Generation: 4s, Training: 46s. Estimated remaining time: 13h 46m 19s. Estimated total time: 14h 33m 10s. Time estimates for 10 more iterations: 8m 43s, 100 more iterations: 1h 27m 19s, 500 more iterations: 7h 16m 35s. [2026-03-25 14:52:56,047][__main__][INFO] - Starting iteration 48. [2026-03-25 14:52:56,052][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 14:52:56,052][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:53:01,095][__main__][INFO] - Number of regex retries in iteration 48: 0 [2026-03-25 14:53:01,096][__main__][INFO] - agents played in iteration 48 are Bob, Alice [2026-03-25 14:53:01,966][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:53:02,031][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:53:02,031][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:53:02,032][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:53:02,790][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:53:03,400][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:53:04,060][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:53:04,720][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:53:05,380][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:53:06,040][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:53:06,699][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:53:07,358][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:53:08,018][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:53:08,679][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:53:09,339][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:53:09,999][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:53:10,659][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:53:11,319][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:53:11,979][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:53:12,640][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:53:13,302][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:53:13,965][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:53:14,625][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:53:15,284][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:53:15,943][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:53:16,602][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:53:17,260][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:53:17,920][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:53:18,580][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:53:19,240][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:53:19,900][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:53:20,558][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:53:21,217][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:53:21,876][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:53:22,536][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:53:23,194][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:53:23,853][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:53:24,512][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:53:25,171][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:53:25,830][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:53:26,490][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:53:27,149][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:53:27,809][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:53:28,470][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:53:29,130][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:53:29,789][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:53:30,448][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:53:31,107][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:53:31,767][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:53:32,425][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:53:33,084][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:53:33,742][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:53:34,723][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:53:35,385][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:53:36,045][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:53:36,704][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:53:37,363][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:53:38,021][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:53:38,680][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:53:39,339][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:53:39,997][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:53:40,656][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:53:41,315][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:53:41,975][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:53:42,634][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:53:43,295][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:53:43,954][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:53:44,612][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:53:45,271][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:53:46,230][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 14:53:47,580][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:53:47,583][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:53:47,584][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:53:49,016][__main__][INFO] - Iteration 49 took 52s (9.52% Gen, 87.77% Train). Generation: 5s, Training: 46s. Estimated remaining time: 13h 55m 3s. Estimated total time: 14h 42m 46s. Time estimates for 10 more iterations: 8m 49s, 100 more iterations: 1h 28m 16s, 500 more iterations: 7h 21m 23s. [2026-03-25 14:53:49,020][__main__][INFO] - Starting iteration 49. [2026-03-25 14:53:49,024][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 14:53:49,024][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:53:54,644][__main__][INFO] - Number of regex retries in iteration 49: 0 [2026-03-25 14:53:54,645][__main__][INFO] - agents played in iteration 49 are Bob, Alice [2026-03-25 14:53:55,446][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:53:55,511][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:53:55,511][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:53:55,512][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:53:56,274][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:53:56,891][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:53:57,551][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:53:58,212][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:53:58,871][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:53:59,536][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:54:00,196][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:54:00,857][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:54:01,518][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:54:02,180][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:54:02,838][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:54:03,498][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:54:04,157][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:54:04,816][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:54:05,475][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:54:06,134][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:54:06,793][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:54:07,453][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:54:08,112][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:54:08,770][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:54:09,430][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:54:10,088][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:54:10,747][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:54:11,407][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:54:12,066][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:54:12,727][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:54:13,383][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:54:14,042][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:54:14,702][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:54:15,361][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:54:16,020][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:54:16,679][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:54:17,338][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:54:17,998][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:54:18,656][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:54:19,316][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:54:19,975][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:54:20,634][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:54:21,296][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:54:21,955][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:54:22,613][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:54:23,274][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:54:23,934][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:54:24,594][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:54:25,253][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:54:25,913][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:54:26,571][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:54:27,230][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:54:28,209][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:54:28,868][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:54:29,526][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:54:30,185][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:54:30,844][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:54:31,503][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:54:32,162][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:54:32,821][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:54:33,479][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:54:34,139][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:54:34,798][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:54:35,457][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:54:36,115][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:54:36,773][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:54:37,432][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:54:38,090][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:54:38,749][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:54:39,613][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 14:54:41,100][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:54:41,104][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:54:41,105][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:54:42,758][__main__][INFO] - Iteration 50 took 53s (10.46% Gen, 86.46% Train). Generation: 5s, Training: 46s. Estimated remaining time: 14h 6m 58s. Estimated total time: 14h 55m 35s. Time estimates for 10 more iterations: 8m 57s, 100 more iterations: 1h 29m 33s, 500 more iterations: 7h 27m 47s. [2026-03-25 14:54:42,760][__main__][INFO] - Starting iteration 50. [2026-03-25 14:54:42,765][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 14:54:42,765][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:54:47,671][__main__][INFO] - Number of regex retries in iteration 50: 0 [2026-03-25 14:54:47,673][__main__][INFO] - agents played in iteration 50 are Bob, Alice [2026-03-25 14:54:48,127][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:54:48,193][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:54:48,193][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:54:48,194][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:54:48,895][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:54:49,533][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:54:50,177][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:54:50,835][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:54:51,496][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:54:52,157][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:54:52,817][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:54:53,478][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:54:54,138][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:54:54,798][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:54:55,458][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:54:56,117][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:54:56,776][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:54:57,435][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:54:58,096][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:54:58,755][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:54:59,415][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:55:00,074][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:55:00,734][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:55:01,394][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:55:02,052][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:55:02,713][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:55:03,372][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:55:04,031][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:55:04,689][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:55:05,348][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:55:06,008][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:55:06,668][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:55:07,327][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:55:07,986][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:55:08,645][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:55:09,304][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:55:09,963][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:55:10,622][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:55:11,283][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:55:11,941][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:55:12,599][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:55:13,258][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:55:13,917][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:55:14,576][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:55:15,235][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:55:15,895][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:55:16,555][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:55:17,215][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:55:17,874][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:55:18,533][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:55:19,195][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:55:19,855][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:55:20,832][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:55:21,492][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:55:22,152][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:55:22,811][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:55:23,470][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:55:24,129][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:55:24,791][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:55:25,448][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:55:26,107][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:55:26,765][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:55:27,425][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:55:28,082][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:55:28,745][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:55:29,404][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:55:30,062][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:55:30,721][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:55:31,379][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:55:32,270][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 14:55:33,629][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:55:33,632][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:55:33,633][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:55:36,329][__main__][INFO] - Iteration 51 took 53s (9.16% Gen, 85.80% Train). Generation: 4s, Training: 45s. Estimated remaining time: 14h 3m 15s. Estimated total time: 14h 52m 46s. Time estimates for 10 more iterations: 8m 55s, 100 more iterations: 1h 29m 16s, 500 more iterations: 7h 26m 23s. [2026-03-25 14:55:36,332][__main__][INFO] - Starting iteration 51. [2026-03-25 14:55:36,337][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 14:55:36,337][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:55:41,080][__main__][INFO] - Number of regex retries in iteration 51: 0 [2026-03-25 14:55:41,081][__main__][INFO] - agents played in iteration 51 are Bob, Alice [2026-03-25 14:55:41,638][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:55:41,704][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:55:41,705][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:55:41,705][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:55:42,352][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:55:42,964][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:55:43,624][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:55:44,282][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:55:44,941][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:55:45,600][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:55:46,259][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:55:46,918][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:55:47,581][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:55:48,240][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:55:48,901][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:55:49,561][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:55:50,221][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:55:50,880][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:55:51,538][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:55:52,197][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:55:52,857][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:55:53,516][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:55:54,175][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:55:54,839][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:55:55,499][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:55:56,159][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:55:56,819][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:55:57,481][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:55:58,141][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:55:58,802][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:55:59,463][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:56:00,121][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:56:00,783][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:56:01,442][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:56:02,102][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:56:02,762][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:56:03,421][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:56:04,080][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:56:04,740][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:56:05,400][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:56:06,061][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:56:06,720][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:56:07,380][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:56:08,039][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:56:08,697][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:56:09,357][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:56:10,017][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:56:10,677][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:56:11,337][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:56:11,996][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:56:12,655][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:56:13,314][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:56:14,294][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:56:14,955][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:56:15,614][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:56:16,272][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:56:16,931][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:56:17,588][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:56:18,248][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:56:18,908][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:56:19,565][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:56:20,224][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:56:20,885][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:56:21,544][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:56:22,206][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:56:22,865][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:56:23,525][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:56:24,184][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:56:24,842][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:56:25,742][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 14:56:27,113][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:56:27,115][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:56:27,117][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:56:28,546][__main__][INFO] - Iteration 52 took 52s (9.08% Gen, 88.17% Train). Generation: 4s, Training: 46s. Estimated remaining time: 13h 39m 48s. Estimated total time: 14h 30m 11s. Time estimates for 10 more iterations: 8m 42s, 100 more iterations: 1h 27m 1s, 500 more iterations: 7h 15m 5s. [2026-03-25 14:56:28,548][__main__][INFO] - Starting iteration 52. [2026-03-25 14:56:28,551][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 14:56:28,552][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:56:33,309][__main__][INFO] - Number of regex retries in iteration 52: 0 [2026-03-25 14:56:33,310][__main__][INFO] - agents played in iteration 52 are Bob, Alice [2026-03-25 14:56:33,901][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:56:33,965][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:56:33,965][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:56:33,966][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:56:34,842][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:56:35,452][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:56:36,116][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:56:36,777][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:56:37,437][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:56:38,096][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:56:38,757][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:56:39,419][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:56:40,081][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:56:40,740][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:56:41,401][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:56:42,059][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:56:42,719][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:56:43,379][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:56:44,038][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:56:44,698][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:56:45,357][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:56:46,016][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:56:46,677][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:56:47,337][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:56:47,996][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:56:48,655][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:56:49,313][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:56:49,973][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:56:50,633][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:56:51,292][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:56:51,951][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:56:52,611][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:56:53,270][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:56:53,929][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:56:54,590][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:56:55,249][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:56:55,909][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:56:56,567][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:56:57,227][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:56:57,888][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:56:58,547][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:56:59,207][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:56:59,866][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:57:00,525][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:57:01,186][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:57:01,844][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:57:02,504][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:57:03,163][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:57:03,824][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:57:04,482][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:57:05,142][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:57:05,801][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:57:06,814][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:57:07,472][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:57:08,130][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:57:08,788][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:57:09,447][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:57:10,108][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:57:10,770][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:57:11,428][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:57:12,086][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:57:12,746][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:57:13,404][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:57:14,062][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:57:14,722][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:57:15,381][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:57:16,040][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:57:16,699][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:57:17,357][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:57:18,125][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 14:57:19,748][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:57:19,751][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:57:19,752][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:57:21,077][__main__][INFO] - Iteration 53 took 52s (9.06% Gen, 88.41% Train). Generation: 4s, Training: 46s. Estimated remaining time: 13h 44m 12s. Estimated total time: 14h 35m 27s. Time estimates for 10 more iterations: 8m 45s, 100 more iterations: 1h 27m 32s, 500 more iterations: 7h 17m 43s. [2026-03-25 14:57:21,080][__main__][INFO] - Starting iteration 53. [2026-03-25 14:57:21,083][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 14:57:21,084][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:57:26,120][__main__][INFO] - Number of regex retries in iteration 53: 0 [2026-03-25 14:57:26,120][__main__][INFO] - agents played in iteration 53 are Bob, Alice [2026-03-25 14:57:26,611][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:57:26,677][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:57:26,678][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:57:26,678][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:57:27,374][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:57:28,007][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:57:28,666][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:57:29,324][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:57:29,985][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:57:30,644][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:57:31,303][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:57:31,963][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:57:32,621][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:57:33,283][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:57:33,943][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:57:34,602][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:57:35,261][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:57:35,919][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:57:36,579][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:57:37,239][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:57:37,896][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:57:38,557][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:57:39,217][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:57:39,875][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:57:40,533][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:57:41,192][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:57:41,853][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:57:42,510][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:57:43,171][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:57:43,830][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:57:44,489][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:57:45,148][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:57:45,807][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:57:46,467][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:57:47,126][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:57:47,784][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:57:48,443][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:57:49,101][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:57:49,761][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:57:50,424][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:57:51,081][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:57:51,739][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:57:52,398][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:57:53,057][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:57:53,718][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:57:54,377][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:57:55,035][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:57:55,693][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:57:56,350][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:57:57,008][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:57:57,667][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:57:58,325][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:57:59,331][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:57:59,990][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:58:00,648][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:58:01,309][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:58:01,969][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:58:02,628][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:58:03,287][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:58:03,946][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:58:04,604][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:58:05,264][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:58:05,922][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:58:06,581][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:58:07,240][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:58:07,899][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:58:08,559][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:58:09,218][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:58:09,878][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:58:10,653][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 14:58:11,997][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:58:12,000][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:58:12,005][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:58:13,585][__main__][INFO] - Iteration 54 took 52s (9.59% Gen, 87.39% Train). Generation: 5s, Training: 45s. Estimated remaining time: 13h 42m 54s. Estimated total time: 14h 35m 3s. Time estimates for 10 more iterations: 8m 45s, 100 more iterations: 1h 27m 30s, 500 more iterations: 7h 17m 31s. [2026-03-25 14:58:13,587][__main__][INFO] - Starting iteration 54. [2026-03-25 14:58:13,592][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 14:58:13,593][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:58:18,760][__main__][INFO] - Number of regex retries in iteration 54: 0 [2026-03-25 14:58:18,760][__main__][INFO] - agents played in iteration 54 are Bob, Alice [2026-03-25 14:58:19,846][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:58:19,912][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:58:19,913][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:58:19,913][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:58:20,686][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:58:21,351][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:58:21,962][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:58:22,621][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:58:23,282][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:58:23,941][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:58:24,602][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:58:25,262][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:58:25,921][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:58:26,582][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:58:27,241][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:58:27,902][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:58:28,562][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:58:29,223][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:58:29,883][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:58:30,543][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:58:31,203][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:58:31,863][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:58:32,522][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:58:33,183][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:58:33,843][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:58:34,502][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:58:35,162][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:58:35,822][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:58:36,482][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:58:37,143][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:58:37,803][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:58:38,463][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:58:39,123][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:58:39,785][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:58:40,445][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:58:41,104][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:58:41,763][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:58:42,422][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:58:43,082][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:58:43,740][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:58:44,400][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:58:45,060][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:58:45,720][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:58:46,382][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:58:47,042][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:58:47,700][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:58:48,358][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:58:49,017][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:58:49,676][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:58:50,335][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:58:50,993][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:58:51,652][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:58:52,637][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:58:53,299][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:58:53,956][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:58:54,615][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:58:55,274][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:58:55,934][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:58:56,593][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:58:57,251][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:58:57,909][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:58:58,568][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:58:59,226][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:58:59,884][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:59:00,542][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:59:01,200][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:59:01,859][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:59:02,518][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:59:03,176][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:59:04,017][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 14:59:05,400][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:59:05,403][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:59:05,405][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:59:06,848][__main__][INFO] - Iteration 55 took 53s (9.70% Gen, 87.58% Train). Generation: 5s, Training: 46s. Estimated remaining time: 13h 54m 36s. Estimated total time: 14h 47m 38s. Time estimates for 10 more iterations: 8m 52s, 100 more iterations: 1h 28m 45s, 500 more iterations: 7h 23m 49s. [2026-03-25 14:59:06,851][__main__][INFO] - Starting iteration 55. [2026-03-25 14:59:06,856][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 14:59:06,856][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:59:15,722][__main__][INFO] - Number of regex retries in iteration 55: 0 [2026-03-25 14:59:15,724][__main__][INFO] - agents played in iteration 55 are Bob, Alice [2026-03-25 14:59:16,748][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:59:16,813][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 14:59:16,814][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:59:16,814][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:59:17,633][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:59:18,259][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:59:18,918][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:59:19,577][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:59:20,236][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:59:20,895][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:59:21,553][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:59:22,214][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:59:22,873][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:59:23,532][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:59:24,192][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:59:24,850][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:59:25,511][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:59:26,172][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:59:26,833][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:59:27,495][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:59:28,153][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:59:28,813][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:59:29,472][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:59:30,131][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:59:30,789][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:59:31,448][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:59:32,107][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:59:32,766][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:59:33,425][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:59:34,084][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:59:34,743][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:59:35,401][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:59:36,061][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:59:36,719][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:59:37,377][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:59:38,034][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:59:38,693][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:59:39,351][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:59:40,010][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:59:40,667][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:59:41,325][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:59:41,984][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:59:42,643][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:59:43,301][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:59:43,959][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:59:44,618][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:59:45,275][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:59:45,933][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:59:46,591][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:59:47,250][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:59:47,908][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:59:48,566][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:59:49,548][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:59:50,210][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:59:50,874][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:59:51,533][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:59:52,191][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:59:52,851][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:59:53,510][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:59:54,169][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:59:54,829][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:59:55,489][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:59:56,147][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:59:56,805][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:59:57,464][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:59:58,123][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:59:58,782][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:59:59,440][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:00:00,099][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:00:00,894][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:00:02,378][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:00:02,381][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:00:02,382][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:00:04,736][__main__][INFO] - Iteration 56 took 57s (15.32% Gen, 80.61% Train). Generation: 8s, Training: 46s. Estimated remaining time: 15h 10m 42s. Estimated total time: 16h 4m 42s. Time estimates for 10 more iterations: 9m 38s, 100 more iterations: 1h 36m 28s, 500 more iterations: 8h 2m 21s. [2026-03-25 15:00:04,738][__main__][INFO] - Starting iteration 56. [2026-03-25 15:00:04,743][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:00:04,743][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:00:09,561][__main__][INFO] - Number of regex retries in iteration 56: 0 [2026-03-25 15:00:09,562][__main__][INFO] - agents played in iteration 56 are Bob, Alice [2026-03-25 15:00:10,134][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:00:10,200][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:00:10,201][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:00:10,201][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:00:10,856][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:00:11,475][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:00:12,136][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:00:12,795][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:00:13,455][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:00:14,114][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:00:14,775][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:00:15,434][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:00:16,094][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:00:16,756][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:00:17,415][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:00:18,075][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:00:18,736][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:00:19,396][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:00:20,055][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:00:20,716][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:00:21,376][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:00:22,036][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:00:22,695][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:00:23,357][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:00:24,018][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:00:24,679][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:00:25,339][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:00:25,999][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:00:26,663][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:00:27,318][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:00:27,978][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:00:28,639][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:00:29,299][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:00:29,959][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:00:30,620][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:00:31,281][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:00:31,942][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:00:32,605][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:00:33,265][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:00:33,925][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:00:34,586][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:00:35,248][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:00:35,907][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:00:36,567][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:00:37,227][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:00:37,888][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:00:38,548][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:00:39,210][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:00:39,871][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:00:40,530][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:00:41,189][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:00:41,849][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:00:42,826][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:00:43,488][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:00:44,148][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:00:44,806][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:00:45,467][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:00:46,126][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:00:46,786][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:00:47,447][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:00:48,107][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:00:48,766][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:00:49,426][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:00:50,085][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:00:50,743][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:00:51,401][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:00:52,059][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:00:52,717][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:00:53,375][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:00:54,143][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:00:55,504][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:00:55,506][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:00:55,507][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:00:56,982][__main__][INFO] - Iteration 57 took 52s (9.22% Gen, 87.95% Train). Generation: 4s, Training: 45s. Estimated remaining time: 13h 35m 50s. Estimated total time: 14h 30m 42s. Time estimates for 10 more iterations: 8m 42s, 100 more iterations: 1h 27m 4s, 500 more iterations: 7h 15m 21s. [2026-03-25 15:00:56,985][__main__][INFO] - Starting iteration 57. [2026-03-25 15:00:56,988][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:00:56,989][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:01:02,005][__main__][INFO] - Number of regex retries in iteration 57: 0 [2026-03-25 15:01:02,007][__main__][INFO] - agents played in iteration 57 are Bob, Alice [2026-03-25 15:01:02,581][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:01:02,647][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:01:02,648][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:01:02,649][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:01:03,470][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:01:04,109][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:01:04,774][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:01:05,434][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:01:06,095][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:01:06,754][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:01:07,414][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:01:08,078][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:01:08,739][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:01:09,400][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:01:10,059][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:01:10,718][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:01:11,380][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:01:12,043][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:01:12,702][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:01:13,363][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:01:14,022][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:01:14,682][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:01:15,341][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:01:16,001][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:01:16,660][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:01:17,320][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:01:17,980][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:01:18,639][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:01:19,300][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:01:19,959][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:01:20,618][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:01:21,279][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:01:21,938][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:01:22,599][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:01:23,259][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:01:23,920][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:01:24,589][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:01:25,255][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:01:25,913][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:01:26,575][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:01:27,237][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:01:27,895][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:01:28,555][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:01:29,217][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:01:29,878][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:01:30,540][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:01:31,197][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:01:31,857][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:01:32,516][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:01:33,176][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:01:33,835][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:01:34,494][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:01:35,520][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:01:36,179][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:01:36,842][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:01:37,503][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:01:38,162][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:01:38,820][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:01:39,479][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:01:40,140][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:01:40,799][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:01:41,458][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:01:42,116][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:01:42,776][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:01:43,435][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:01:44,094][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:01:44,754][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:01:45,413][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:01:46,072][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:01:46,838][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:01:48,211][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:01:48,214][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:01:48,215][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:01:49,641][__main__][INFO] - Iteration 58 took 52s (9.53% Gen, 87.76% Train). Generation: 5s, Training: 46s. Estimated remaining time: 13h 41m 50s. Estimated total time: 14h 37m 34s. Time estimates for 10 more iterations: 8m 46s, 100 more iterations: 1h 27m 45s, 500 more iterations: 7h 18m 47s. [2026-03-25 15:01:49,643][__main__][INFO] - Starting iteration 58. [2026-03-25 15:01:49,647][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:01:49,648][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:02:00,258][__main__][INFO] - Number of regex retries in iteration 58: 0 [2026-03-25 15:02:00,260][__main__][INFO] - agents played in iteration 58 are Bob, Alice [2026-03-25 15:02:01,273][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:02:01,347][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:02:01,347][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:02:01,348][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:02:02,016][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:02:02,646][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:02:03,306][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:02:03,967][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:02:04,629][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:02:05,289][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:02:05,950][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:02:06,611][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:02:07,271][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:02:07,930][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:02:08,591][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:02:09,250][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:02:09,911][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:02:10,571][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:02:11,231][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:02:11,891][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:02:12,550][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:02:13,209][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:02:13,867][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:02:14,526][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:02:15,186][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:02:15,846][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:02:16,505][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:02:17,164][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:02:17,825][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:02:18,485][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:02:19,145][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:02:19,804][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:02:20,463][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:02:21,124][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:02:21,826][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:02:22,485][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:02:23,145][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:02:23,805][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:02:24,464][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:02:25,123][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:02:25,784][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:02:26,443][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:02:27,102][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:02:27,761][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:02:28,420][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:02:29,078][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:02:29,738][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:02:30,396][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:02:31,057][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:02:31,716][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:02:32,375][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:02:33,033][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:02:34,016][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:02:34,679][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:02:35,339][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:02:36,002][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:02:36,660][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:02:37,319][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:02:37,978][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:02:38,636][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:02:39,296][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:02:39,956][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:02:40,615][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:02:41,274][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:02:41,937][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:02:42,597][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:02:43,256][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:02:43,914][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:02:44,573][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:02:45,346][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:02:46,826][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:02:46,829][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:02:46,830][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:02:48,259][__main__][INFO] - Iteration 59 took 58s (18.10% Gen, 79.45% Train). Generation: 10s, Training: 46s. Estimated remaining time: 15h 20m 11s. Estimated total time: 16h 16m 54s. Time estimates for 10 more iterations: 9m 46s, 100 more iterations: 1h 37m 41s, 500 more iterations: 8h 8m 27s. [2026-03-25 15:02:48,262][__main__][INFO] - Starting iteration 59. [2026-03-25 15:02:48,266][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:02:48,266][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:02:53,997][__main__][INFO] - Number of regex retries in iteration 59: 0 [2026-03-25 15:02:53,998][__main__][INFO] - agents played in iteration 59 are Bob, Alice [2026-03-25 15:02:54,456][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:02:54,522][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:02:54,522][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:02:54,523][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:02:55,343][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:02:55,961][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:02:56,620][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:02:57,280][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:02:57,940][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:02:58,604][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:02:59,264][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:02:59,926][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:03:00,586][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:03:01,249][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:03:01,908][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:03:02,568][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:03:03,229][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:03:03,891][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:03:04,550][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:03:05,209][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:03:05,868][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:03:06,532][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:03:07,192][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:03:07,855][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:03:08,517][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:03:09,179][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:03:09,844][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:03:10,505][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:03:11,164][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:03:11,823][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:03:12,483][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:03:13,142][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:03:13,800][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:03:14,460][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:03:15,119][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:03:15,779][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:03:16,438][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:03:17,099][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:03:17,759][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:03:18,419][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:03:19,078][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:03:19,737][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:03:20,396][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:03:21,056][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:03:21,717][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:03:22,378][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:03:23,039][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:03:23,699][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:03:24,362][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:03:25,021][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:03:25,681][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:03:26,341][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:03:27,334][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:03:27,996][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:03:28,657][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:03:29,318][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:03:29,975][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:03:30,634][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:03:31,293][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:03:31,953][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:03:32,614][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:03:33,272][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:03:33,931][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:03:34,589][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:03:35,248][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:03:35,907][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:03:36,567][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:03:37,226][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:03:37,885][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:03:38,666][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:03:40,180][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:03:40,183][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:03:40,184][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:03:41,714][__main__][INFO] - Iteration 60 took 53s (10.72% Gen, 86.41% Train). Generation: 5s, Training: 46s. Estimated remaining time: 13h 53m 12s. Estimated total time: 14h 50m 49s. Time estimates for 10 more iterations: 8m 54s, 100 more iterations: 1h 29m 4s, 500 more iterations: 7h 25m 24s. [2026-03-25 15:03:41,716][__main__][INFO] - Starting iteration 60. [2026-03-25 15:03:41,720][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:03:41,720][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:03:46,677][__main__][INFO] - Number of regex retries in iteration 60: 0 [2026-03-25 15:03:46,678][__main__][INFO] - agents played in iteration 60 are Bob, Alice [2026-03-25 15:03:47,248][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:03:47,315][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:03:47,315][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:03:47,316][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:03:47,994][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:03:48,615][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:03:49,276][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:03:49,936][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:03:50,596][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:03:51,257][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:03:51,918][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:03:52,577][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:03:53,238][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:03:53,897][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:03:54,556][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:03:55,216][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:03:55,876][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:03:56,537][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:03:57,197][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:03:57,856][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:03:58,516][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:03:59,175][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:03:59,838][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:04:00,499][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:04:01,160][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:04:01,820][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:04:02,481][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:04:03,144][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:04:03,804][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:04:04,463][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:04:05,122][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:04:05,781][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:04:06,441][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:04:07,100][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:04:07,760][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:04:08,420][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:04:09,080][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:04:09,740][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:04:10,400][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:04:11,062][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:04:11,722][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:04:12,381][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:04:13,040][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:04:13,700][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:04:14,360][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:04:15,019][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:04:15,678][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:04:16,337][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:04:16,996][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:04:17,655][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:04:18,314][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:04:18,973][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:04:19,949][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:04:20,608][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:04:21,267][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:04:21,926][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:04:22,585][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:04:23,243][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:04:23,903][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:04:24,562][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:04:25,221][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:04:25,879][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:04:26,537][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:04:27,195][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:04:27,854][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:04:28,512][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:04:29,171][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:04:29,829][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:04:30,488][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:04:31,429][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:04:32,797][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:04:32,800][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:04:32,801][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:04:34,195][__main__][INFO] - Iteration 61 took 52s (9.45% Gen, 87.89% Train). Generation: 4s, Training: 46s. Estimated remaining time: 13h 36m 8s. Estimated total time: 14h 34m 36s. Time estimates for 10 more iterations: 8m 44s, 100 more iterations: 1h 27m 27s, 500 more iterations: 7h 17m 18s. [2026-03-25 15:04:34,197][__main__][INFO] - Starting iteration 61. [2026-03-25 15:04:34,200][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:04:34,201][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:04:40,958][__main__][INFO] - Number of regex retries in iteration 61: 0 [2026-03-25 15:04:40,960][__main__][INFO] - agents played in iteration 61 are Bob, Alice [2026-03-25 15:04:41,529][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:04:41,594][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:04:41,595][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:04:41,595][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:04:42,441][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:04:43,064][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:04:43,725][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:04:44,386][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:04:45,046][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:04:45,705][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:04:46,366][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:04:47,026][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:04:47,685][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:04:48,345][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:04:49,005][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:04:49,666][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:04:50,324][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:04:50,984][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:04:51,644][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:04:52,303][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:04:52,963][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:04:53,623][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:04:54,283][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:04:54,943][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:04:55,602][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:04:56,262][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:04:56,921][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:04:57,580][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:04:58,240][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:04:58,900][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:04:59,559][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:05:00,219][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:05:00,879][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:05:01,538][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:05:02,198][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:05:02,856][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:05:03,516][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:05:04,175][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:05:04,836][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:05:05,495][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:05:06,154][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:05:06,813][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:05:07,472][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:05:08,131][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:05:08,790][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:05:09,449][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:05:10,109][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:05:10,769][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:05:11,428][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:05:12,087][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:05:12,746][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:05:13,405][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:05:14,386][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:05:15,047][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:05:15,705][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:05:16,363][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:05:17,022][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:05:17,680][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:05:18,340][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:05:18,998][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:05:19,657][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:05:20,317][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:05:20,977][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:05:21,634][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:05:22,293][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:05:22,952][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:05:23,612][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:05:24,271][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:05:24,931][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:05:25,656][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:05:27,139][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:05:27,145][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:05:27,146][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:05:28,569][__main__][INFO] - Iteration 62 took 54s (12.43% Gen, 84.95% Train). Generation: 6s, Training: 46s. Estimated remaining time: 14h 6m 46s. Estimated total time: 15h 6m 9s. Time estimates for 10 more iterations: 9m 3s, 100 more iterations: 1h 30m 36s, 500 more iterations: 7h 33m 4s. [2026-03-25 15:05:28,571][__main__][INFO] - Starting iteration 62. [2026-03-25 15:05:28,574][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:05:28,575][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:05:33,190][__main__][INFO] - Number of regex retries in iteration 62: 0 [2026-03-25 15:05:33,191][__main__][INFO] - agents played in iteration 62 are Bob, Alice [2026-03-25 15:05:33,764][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:05:33,829][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:05:33,829][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:05:33,830][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:05:34,593][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:05:35,215][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:05:35,879][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:05:36,538][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:05:37,198][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:05:37,859][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:05:38,519][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:05:39,180][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:05:39,839][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:05:40,501][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:05:41,161][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:05:41,822][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:05:42,481][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:05:43,140][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:05:43,798][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:05:44,457][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:05:45,119][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:05:45,779][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:05:46,438][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:05:47,098][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:05:47,759][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:05:48,418][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:05:49,077][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:05:49,736][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:05:50,397][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:05:51,057][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:05:51,715][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:05:52,374][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:05:53,033][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:05:53,692][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:05:54,351][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:05:55,010][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:05:55,669][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:05:56,328][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:05:56,987][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:05:57,646][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:05:58,307][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:05:58,967][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:05:59,626][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:06:00,284][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:06:00,943][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:06:01,603][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:06:02,262][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:06:02,921][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:06:03,579][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:06:04,239][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:06:04,898][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:06:05,560][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:06:06,540][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:06:07,199][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:06:07,857][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:06:08,515][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:06:09,174][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:06:09,833][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:06:10,491][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:06:11,149][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:06:11,807][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:06:12,465][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:06:13,124][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:06:13,781][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:06:14,440][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:06:15,098][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:06:15,757][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:06:16,417][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:06:17,076][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:06:18,003][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:06:19,460][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:06:19,464][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:06:19,466][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:06:20,919][__main__][INFO] - Iteration 63 took 52s (8.82% Gen, 88.40% Train). Generation: 4s, Training: 46s. Estimated remaining time: 13h 32m 10s. Estimated total time: 14h 32m 26s. Time estimates for 10 more iterations: 8m 43s, 100 more iterations: 1h 27m 14s, 500 more iterations: 7h 16m 13s. [2026-03-25 15:06:20,921][__main__][INFO] - Starting iteration 63. [2026-03-25 15:06:20,925][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:06:20,925][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:06:26,546][__main__][INFO] - Number of regex retries in iteration 63: 0 [2026-03-25 15:06:26,548][__main__][INFO] - agents played in iteration 63 are Bob, Alice [2026-03-25 15:06:27,411][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:06:27,477][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:06:27,478][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:06:27,478][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:06:28,130][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:06:28,748][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:06:29,408][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:06:30,067][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:06:30,726][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:06:31,385][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:06:32,045][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:06:32,708][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:06:33,369][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:06:34,028][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:06:34,689][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:06:35,349][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:06:36,008][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:06:36,667][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:06:37,326][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:06:37,986][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:06:38,646][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:06:39,305][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:06:39,964][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:06:40,622][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:06:41,281][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:06:41,940][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:06:42,600][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:06:43,259][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:06:43,918][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:06:44,576][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:06:45,235][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:06:45,894][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:06:46,554][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:06:47,214][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:06:47,873][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:06:48,532][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:06:49,191][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:06:49,851][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:06:50,511][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:06:51,169][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:06:51,828][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:06:52,488][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:06:53,147][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:06:53,806][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:06:54,467][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:06:55,125][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:06:55,784][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:06:56,445][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:06:57,106][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:06:57,765][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:06:58,425][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:06:59,085][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:07:00,091][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:07:00,750][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:07:01,411][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:07:02,069][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:07:02,728][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:07:03,388][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:07:04,046][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:07:04,704][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:07:05,363][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:07:06,027][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:07:06,685][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:07:07,346][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:07:08,005][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:07:08,664][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:07:09,323][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:07:09,982][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:07:10,640][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:07:11,424][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:07:12,807][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:07:12,810][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:07:12,821][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:07:14,336][__main__][INFO] - Iteration 64 took 53s (10.53% Gen, 86.63% Train). Generation: 5s, Training: 46s. Estimated remaining time: 13h 49m 3s. Estimated total time: 14h 50m 12s. Time estimates for 10 more iterations: 8m 54s, 100 more iterations: 1h 29m 1s, 500 more iterations: 7h 25m 6s. [2026-03-25 15:07:14,338][__main__][INFO] - Starting iteration 64. [2026-03-25 15:07:14,352][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:07:14,352][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:07:32,264][__main__][INFO] - Number of regex retries in iteration 64: 0 [2026-03-25 15:07:32,266][__main__][INFO] - agents played in iteration 64 are Bob, Alice [2026-03-25 15:07:33,284][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:07:33,352][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:07:33,353][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:07:33,353][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:07:34,039][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:07:34,656][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:07:35,316][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:07:35,975][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:07:36,636][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:07:37,294][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:07:37,955][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:07:38,614][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:07:39,274][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:07:39,933][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:07:40,592][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:07:41,250][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:07:41,909][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:07:42,568][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:07:43,227][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:07:43,886][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:07:44,545][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:07:45,204][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:07:45,864][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:07:46,523][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:07:47,182][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:07:47,841][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:07:48,500][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:07:49,160][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:07:49,820][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:07:50,483][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:07:51,142][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:07:51,800][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:07:52,460][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:07:53,121][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:07:53,781][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:07:54,440][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:07:55,100][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:07:55,759][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:07:56,420][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:07:57,079][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:07:57,738][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:07:58,398][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:07:59,059][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:07:59,718][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:08:00,377][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:08:01,038][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:08:01,697][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:08:02,357][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:08:03,016][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:08:03,676][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:08:04,335][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:08:04,994][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:08:05,985][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:08:06,645][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:08:07,303][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:08:07,961][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:08:08,619][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:08:09,279][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:08:09,938][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:08:10,599][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:08:11,258][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:08:11,916][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:08:12,575][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:08:13,234][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:08:13,892][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:08:14,551][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:08:15,211][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:08:15,869][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:08:16,530][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:08:17,298][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:08:18,636][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:08:18,640][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:08:18,641][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:08:20,118][__main__][INFO] - Iteration 65 took 1m 5s (27.23% Gen, 70.50% Train). Generation: 17s, Training: 46s. Estimated remaining time: 17h 14m 3s. Estimated total time: 18h 16m 18s. Time estimates for 10 more iterations: 10m 57s, 100 more iterations: 1h 49m 37s, 500 more iterations: 9h 8m 9s. [2026-03-25 15:08:20,121][__main__][INFO] - Starting iteration 65. [2026-03-25 15:08:20,133][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:08:20,134][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:08:26,228][__main__][INFO] - Number of regex retries in iteration 65: 0 [2026-03-25 15:08:26,230][__main__][INFO] - agents played in iteration 65 are Bob, Alice [2026-03-25 15:08:27,342][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:08:27,425][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:08:27,429][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:08:27,429][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:08:28,210][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:08:28,829][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:08:29,490][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:08:30,149][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:08:30,808][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:08:31,466][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:08:32,126][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:08:32,785][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:08:33,446][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:08:34,105][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:08:34,763][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:08:35,428][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:08:36,088][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:08:36,747][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:08:37,409][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:08:38,068][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:08:38,727][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:08:39,387][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:08:40,045][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:08:40,703][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:08:41,362][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:08:42,020][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:08:42,681][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:08:43,340][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:08:43,999][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:08:44,657][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:08:45,317][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:08:45,978][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:08:46,637][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:08:47,297][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:08:47,956][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:08:48,615][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:08:49,274][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:08:49,933][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:08:50,593][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:08:51,252][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:08:51,910][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:08:52,568][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:08:53,228][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:08:53,886][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:08:54,544][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:08:55,203][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:08:55,861][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:08:56,519][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:08:57,178][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:08:57,837][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:08:58,497][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:08:59,156][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:09:00,142][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:09:00,802][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:09:01,461][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:09:02,122][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:09:02,779][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:09:03,438][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:09:04,096][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:09:04,755][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:09:05,414][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:09:06,073][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:09:06,732][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:09:07,390][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:09:08,048][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:09:08,707][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:09:09,365][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:09:10,023][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:09:10,683][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:09:11,479][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:09:12,900][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:09:12,903][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:09:12,904][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:09:14,269][__main__][INFO] - Iteration 66 took 54s (11.26% Gen, 86.21% Train). Generation: 6s, Training: 46s. Estimated remaining time: 13h 59m 9s. Estimated total time: 15h 2m 18s. Time estimates for 10 more iterations: 9m 1s, 100 more iterations: 1h 30m 13s, 500 more iterations: 7h 31m 9s. [2026-03-25 15:09:14,272][__main__][INFO] - Starting iteration 66. [2026-03-25 15:09:14,276][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:09:14,277][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:09:32,837][__main__][INFO] - Number of regex retries in iteration 66: 0 [2026-03-25 15:09:32,839][__main__][INFO] - agents played in iteration 66 are Bob, Alice [2026-03-25 15:09:33,888][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:09:33,956][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:09:33,957][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:09:33,957][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:09:34,805][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:09:35,408][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:09:36,067][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:09:36,724][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:09:37,384][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:09:38,043][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:09:38,702][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:09:39,360][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:09:40,019][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:09:40,676][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:09:41,337][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:09:41,995][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:09:42,652][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:09:43,310][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:09:43,971][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:09:44,629][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:09:45,287][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:09:45,945][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:09:46,603][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:09:47,260][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:09:47,921][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:09:48,577][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:09:49,235][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:09:49,895][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:09:50,554][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:09:51,214][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:09:51,872][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:09:52,530][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:09:53,189][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:09:53,849][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:09:54,507][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:09:55,166][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:09:55,824][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:09:56,483][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:09:57,142][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:09:57,800][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:09:58,459][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:09:59,118][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:09:59,776][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:10:00,436][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:10:01,095][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:10:01,752][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:10:02,411][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:10:03,068][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:10:03,727][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:10:04,385][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:10:05,043][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:10:05,701][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:10:06,691][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:10:07,349][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:10:08,008][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:10:08,667][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:10:09,326][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:10:09,985][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:10:10,645][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:10:11,304][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:10:11,963][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:10:12,621][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:10:13,279][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:10:13,937][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:10:14,596][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:10:15,256][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:10:15,916][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:10:16,576][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:10:17,235][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:10:18,003][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:10:19,922][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:10:19,926][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:10:19,927][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:10:21,260][__main__][INFO] - Iteration 67 took 1m 6s (27.71% Gen, 70.30% Train). Generation: 18s, Training: 47s. Estimated remaining time: 17h 32m 8s. Estimated total time: 18h 36m 24s. Time estimates for 10 more iterations: 11m 9s, 100 more iterations: 1h 51m 38s, 500 more iterations: 9h 18m 12s. [2026-03-25 15:10:21,262][__main__][INFO] - Starting iteration 67. [2026-03-25 15:10:21,267][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:10:21,268][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:10:28,771][__main__][INFO] - Number of regex retries in iteration 67: 0 [2026-03-25 15:10:28,772][__main__][INFO] - agents played in iteration 67 are Bob, Alice [2026-03-25 15:10:29,273][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:10:29,339][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:10:29,340][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:10:29,340][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:10:30,190][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:10:30,815][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:10:31,478][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:10:32,147][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:10:32,810][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:10:33,473][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:10:34,132][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:10:34,801][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:10:35,461][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:10:36,121][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:10:36,782][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:10:37,441][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:10:38,103][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:10:38,762][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:10:39,421][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:10:40,080][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:10:40,739][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:10:41,398][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:10:42,059][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:10:42,719][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:10:43,377][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:10:44,036][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:10:44,694][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:10:45,352][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:10:46,011][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:10:46,670][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:10:47,330][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:10:47,989][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:10:48,651][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:10:49,310][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:10:49,970][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:10:50,630][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:10:51,289][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:10:51,947][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:10:52,606][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:10:53,265][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:10:53,925][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:10:54,584][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:10:55,243][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:10:55,901][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:10:56,561][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:10:57,220][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:10:57,880][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:10:58,540][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:10:59,199][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:10:59,860][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:11:00,520][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:11:01,180][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:11:02,161][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:11:02,821][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:11:03,481][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:11:04,141][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:11:04,800][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:11:05,460][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:11:06,120][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:11:06,778][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:11:07,438][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:11:08,098][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:11:08,758][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:11:09,416][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:11:10,074][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:11:10,732][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:11:11,392][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:11:12,049][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:11:12,708][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:11:13,481][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:11:15,352][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:11:15,355][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:11:15,356][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:11:17,002][__main__][INFO] - Iteration 68 took 55s (13.46% Gen, 83.58% Train). Generation: 7s, Training: 46s. Estimated remaining time: 14h 23m 44s. Estimated total time: 15h 28m 56s. Time estimates for 10 more iterations: 9m 17s, 100 more iterations: 1h 32m 53s, 500 more iterations: 7h 44m 28s. [2026-03-25 15:11:17,004][__main__][INFO] - Starting iteration 68. [2026-03-25 15:11:17,009][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:11:17,010][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:11:22,159][__main__][INFO] - Number of regex retries in iteration 68: 0 [2026-03-25 15:11:22,160][__main__][INFO] - agents played in iteration 68 are Bob, Alice [2026-03-25 15:11:22,730][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:11:22,796][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:11:22,796][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:11:22,797][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:11:23,467][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:11:24,072][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:11:24,737][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:11:25,397][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:11:26,056][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:11:26,718][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:11:27,378][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:11:28,037][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:11:28,700][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:11:29,360][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:11:30,018][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:11:30,678][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:11:31,340][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:11:31,999][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:11:32,658][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:11:33,317][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:11:33,977][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:11:34,636][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:11:35,297][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:11:35,956][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:11:36,615][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:11:37,274][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:11:37,933][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:11:38,592][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:11:39,256][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:11:39,911][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:11:40,570][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:11:41,229][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:11:41,889][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:11:42,548][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:11:43,207][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:11:43,865][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:11:44,524][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:11:45,183][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:11:45,841][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:11:46,500][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:11:47,158][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:11:47,816][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:11:48,475][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:11:49,134][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:11:49,796][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:11:50,455][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:11:51,113][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:11:51,772][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:11:52,433][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:11:53,092][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:11:53,750][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:11:54,410][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:11:55,429][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:11:56,093][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:11:56,752][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:11:57,411][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:11:58,070][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:11:58,738][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:11:59,398][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:12:00,056][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:12:00,713][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:12:01,372][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:12:02,031][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:12:02,689][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:12:03,348][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:12:04,006][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:12:04,664][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:12:05,322][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:12:05,980][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:12:06,773][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:12:08,243][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:12:08,245][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:12:08,247][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:12:09,849][__main__][INFO] - Iteration 69 took 52s (9.75% Gen, 87.21% Train). Generation: 5s, Training: 46s. Estimated remaining time: 13h 34m 38s. Estimated total time: 14h 40m 43s. Time estimates for 10 more iterations: 8m 48s, 100 more iterations: 1h 28m 4s, 500 more iterations: 7h 20m 21s. [2026-03-25 15:12:09,861][__main__][INFO] - Starting iteration 69. [2026-03-25 15:12:09,879][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:12:09,880][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:12:15,513][__main__][INFO] - Number of regex retries in iteration 69: 0 [2026-03-25 15:12:15,513][__main__][INFO] - agents played in iteration 69 are Bob, Alice [2026-03-25 15:12:16,358][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:12:16,423][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:12:16,424][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:12:16,424][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:12:17,080][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:12:17,717][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:12:18,369][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:12:19,028][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:12:19,689][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:12:20,348][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:12:21,009][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:12:21,670][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:12:22,330][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:12:22,991][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:12:23,650][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:12:24,311][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:12:24,971][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:12:25,632][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:12:26,292][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:12:26,951][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:12:27,610][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:12:28,269][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:12:28,929][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:12:29,589][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:12:30,249][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:12:30,909][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:12:31,569][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:12:32,231][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:12:32,890][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:12:33,548][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:12:34,208][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:12:34,868][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:12:35,526][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:12:36,185][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:12:36,843][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:12:37,504][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:12:38,164][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:12:38,825][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:12:39,485][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:12:40,144][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:12:40,804][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:12:41,462][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:12:42,121][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:12:42,780][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:12:43,439][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:12:44,096][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:12:44,755][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:12:45,415][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:12:46,076][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:12:46,735][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:12:47,394][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:12:48,053][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:12:49,053][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:12:49,711][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:12:50,370][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:12:51,028][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:12:51,685][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:12:52,343][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:12:53,002][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:12:53,660][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:12:54,318][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:12:54,977][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:12:55,635][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:12:56,293][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:12:56,951][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:12:57,613][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:12:58,275][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:12:58,936][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:12:59,596][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:13:02,844][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:45 [2026-03-25 15:13:04,572][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:13:04,575][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:13:04,576][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:13:06,153][__main__][INFO] - Iteration 70 took 56s (10.01% Gen, 87.18% Train). Generation: 5s, Training: 49s. Estimated remaining time: 14h 30m 55s. Estimated total time: 15h 37m 55s. Time estimates for 10 more iterations: 9m 22s, 100 more iterations: 1h 33m 47s, 500 more iterations: 7h 48m 57s. [2026-03-25 15:13:06,155][__main__][INFO] - Starting iteration 70. [2026-03-25 15:13:06,162][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:13:06,162][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:13:06,879][mllm.models.large_language_model_local][WARNING] - Response >B did not match regex: (|), retry 1/1 [2026-03-25 15:13:13,091][__main__][INFO] - Number of regex retries in iteration 70: 1 [2026-03-25 15:13:13,093][__main__][INFO] - agents played in iteration 70 are Bob, Alice [2026-03-25 15:13:14,121][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:13:14,183][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:13:14,184][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:13:14,184][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:13:14,906][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:13:15,519][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:13:16,181][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:13:16,841][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:13:17,499][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:13:18,158][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:13:18,818][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:13:19,481][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:13:20,141][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:13:20,798][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:13:21,455][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:13:22,115][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:13:22,774][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:13:23,434][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:13:24,094][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:13:24,753][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:13:25,413][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:13:26,071][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:13:26,732][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:13:27,390][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:13:28,050][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:13:28,708][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:13:29,367][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:13:30,026][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:13:30,684][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:13:31,343][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:13:32,001][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:13:32,661][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:13:33,323][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:13:33,981][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:13:34,640][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:13:35,299][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:13:35,957][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:13:36,614][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:13:37,273][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:13:37,931][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:13:38,588][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:13:39,246][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:13:39,904][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:13:40,563][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:13:41,220][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:13:41,879][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:13:42,537][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:13:43,195][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:13:43,854][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:13:44,514][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:13:45,173][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:13:45,832][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:13:46,811][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:13:47,472][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:13:48,131][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:13:49,104][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:13:49,763][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:13:50,423][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:13:51,081][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:13:51,741][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:13:52,400][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:13:53,059][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:13:53,718][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:13:54,376][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:13:55,036][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:13:55,696][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:13:56,356][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:13:57,015][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:13:57,675][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:13:58,561][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:13:59,956][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:13:59,958][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:13:59,960][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:14:01,287][__main__][INFO] - Iteration 71 took 55s (12.57% Gen, 85.02% Train). Generation: 6s, Training: 46s. Estimated remaining time: 14h 10m 51s. Estimated total time: 15h 18m 47s. Time estimates for 10 more iterations: 9m 11s, 100 more iterations: 1h 31m 52s, 500 more iterations: 7h 39m 23s. [2026-03-25 15:14:01,290][__main__][INFO] - Starting iteration 71. [2026-03-25 15:14:01,301][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:14:01,302][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:14:07,628][__main__][INFO] - Number of regex retries in iteration 71: 0 [2026-03-25 15:14:07,629][__main__][INFO] - agents played in iteration 71 are Bob, Alice [2026-03-25 15:14:08,174][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:14:08,236][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:14:08,237][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:14:08,237][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:14:09,116][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:14:09,732][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:14:10,392][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:14:11,051][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:14:11,709][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:14:12,368][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:14:13,028][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:14:13,690][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:14:14,353][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:14:15,014][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:14:15,676][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:14:16,336][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:14:16,996][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:14:17,655][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:14:18,314][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:14:18,974][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:14:19,633][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:14:20,293][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:14:20,952][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:14:21,611][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:14:22,271][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:14:22,930][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:14:23,588][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:14:24,247][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:14:24,906][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:14:25,566][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:14:26,225][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:14:26,883][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:14:27,543][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:14:28,203][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:14:28,863][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:14:29,522][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:14:30,182][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:14:30,843][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:14:31,502][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:14:32,161][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:14:32,820][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:14:33,480][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:14:34,140][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:14:34,801][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:14:35,459][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:14:36,120][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:14:36,783][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:14:37,443][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:14:38,104][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:14:38,764][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:14:39,426][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:14:40,090][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:14:41,090][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:14:41,750][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:14:42,408][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:14:43,066][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:14:43,724][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:14:44,384][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:14:45,042][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:14:45,700][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:14:46,358][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:14:47,017][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:14:47,676][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:14:48,335][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:14:48,994][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:14:49,652][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:14:50,309][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:14:50,968][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:14:51,627][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:14:52,402][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:14:53,931][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:14:53,934][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:14:53,935][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:14:55,727][__main__][INFO] - Iteration 72 took 54s (11.63% Gen, 85.08% Train). Generation: 6s, Training: 46s. Estimated remaining time: 13h 58m 17s. Estimated total time: 15h 7m 7s. Time estimates for 10 more iterations: 9m 4s, 100 more iterations: 1h 30m 42s, 500 more iterations: 7h 33m 33s. [2026-03-25 15:14:55,729][__main__][INFO] - Starting iteration 72. [2026-03-25 15:14:55,734][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:14:55,735][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:15:00,689][__main__][INFO] - Number of regex retries in iteration 72: 0 [2026-03-25 15:15:00,689][__main__][INFO] - agents played in iteration 72 are Bob, Alice [2026-03-25 15:15:01,183][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:15:01,243][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:15:01,244][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:15:01,245][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:15:02,128][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:15:02,756][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:15:03,404][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:15:04,067][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:15:04,726][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:15:05,385][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:15:06,045][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:15:06,704][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:15:07,363][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:15:08,022][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:15:08,682][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:15:09,341][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:15:10,000][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:15:10,659][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:15:11,319][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:15:11,978][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:15:12,637][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:15:13,295][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:15:13,956][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:15:14,616][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:15:15,275][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:15:15,933][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:15:16,591][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:15:17,250][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:15:17,908][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:15:18,568][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:15:19,226][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:15:19,884][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:15:20,544][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:15:21,203][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:15:21,861][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:15:22,520][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:15:23,178][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:15:23,837][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:15:24,495][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:15:25,154][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:15:25,812][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:15:26,473][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:15:27,130][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:15:27,789][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:15:28,449][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:15:29,108][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:15:29,766][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:15:30,426][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:15:31,087][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:15:31,748][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:15:32,407][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:15:33,067][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:15:34,053][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:15:34,713][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:15:35,371][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:15:36,030][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:15:36,689][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:15:37,346][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:15:38,006][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:15:38,664][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:15:39,322][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:15:39,980][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:15:40,640][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:15:41,298][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:15:41,956][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:15:42,615][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:15:43,273][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:15:43,932][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:15:44,590][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:15:45,358][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:15:46,689][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:15:46,691][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:15:46,692][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:15:48,086][__main__][INFO] - Iteration 73 took 52s (9.46% Gen, 87.87% Train). Generation: 4s, Training: 46s. Estimated remaining time: 13h 22m 52s. Estimated total time: 14h 32m 34s. Time estimates for 10 more iterations: 8m 43s, 100 more iterations: 1h 27m 15s, 500 more iterations: 7h 16m 17s. [2026-03-25 15:15:48,088][__main__][INFO] - Starting iteration 73. [2026-03-25 15:15:48,093][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:15:48,093][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:15:54,379][__main__][INFO] - Number of regex retries in iteration 73: 0 [2026-03-25 15:15:54,379][__main__][INFO] - agents played in iteration 73 are Bob, Alice [2026-03-25 15:15:54,953][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:15:55,014][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:15:55,015][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:15:55,015][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:15:55,670][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:15:56,275][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:15:56,936][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:15:57,595][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:15:58,255][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:15:58,914][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:15:59,574][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:16:00,233][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:16:00,891][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:16:01,551][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:16:02,211][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:16:02,871][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:16:03,529][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:16:04,189][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:16:04,847][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:16:05,506][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:16:06,166][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:16:06,826][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:16:07,484][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:16:08,144][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:16:08,803][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:16:09,469][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:16:10,128][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:16:10,787][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:16:11,446][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:16:12,105][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:16:12,763][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:16:13,426][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:16:14,084][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:16:14,747][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:16:15,407][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:16:16,067][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:16:16,727][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:16:17,390][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:16:18,054][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:16:18,716][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:16:19,379][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:16:20,044][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:16:20,712][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:16:21,373][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:16:22,032][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:16:22,692][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:16:23,352][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:16:24,011][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:16:24,672][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:16:25,329][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:16:25,991][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:16:26,650][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:16:27,637][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:16:28,298][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:16:28,959][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:16:29,618][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:16:30,278][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:16:30,935][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:16:31,593][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:16:32,252][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:16:32,910][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:16:33,571][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:16:34,232][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:16:34,888][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:16:35,549][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:16:36,207][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:16:36,866][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:16:37,526][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:16:38,184][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:16:38,966][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:16:40,378][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:16:40,381][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:16:40,383][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:16:42,113][__main__][INFO] - Iteration 74 took 54s (11.64% Gen, 85.16% Train). Generation: 6s, Training: 46s. Estimated remaining time: 13h 49m 45s. Estimated total time: 15h 0m 22s. Time estimates for 10 more iterations: 9m 0s, 100 more iterations: 1h 30m 2s, 500 more iterations: 7h 30m 11s. [2026-03-25 15:16:42,115][__main__][INFO] - Starting iteration 74. [2026-03-25 15:16:42,120][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:16:42,121][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:16:47,035][__main__][INFO] - Number of regex retries in iteration 74: 0 [2026-03-25 15:16:47,036][__main__][INFO] - agents played in iteration 74 are Bob, Alice [2026-03-25 15:16:47,552][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:16:47,617][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:16:47,618][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:16:47,618][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:16:48,268][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:16:48,880][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:16:49,539][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:16:50,199][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:16:50,860][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:16:51,518][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:16:52,176][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:16:52,836][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:16:53,495][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:16:54,155][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:16:54,813][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:16:55,472][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:16:56,130][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:16:56,789][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:16:57,447][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:16:58,107][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:16:58,767][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:16:59,488][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:17:00,150][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:17:00,808][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:17:01,467][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:17:02,133][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:17:02,792][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:17:03,453][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:17:04,115][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:17:04,774][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:17:05,433][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:17:06,091][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:17:06,750][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:17:07,425][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:17:08,084][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:17:08,745][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:17:09,404][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:17:10,061][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:17:10,719][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:17:11,383][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:17:12,043][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:17:12,702][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:17:13,361][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:17:14,019][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:17:14,678][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:17:15,335][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:17:15,995][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:17:16,653][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:17:17,311][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:17:17,970][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:17:18,627][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:17:19,286][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:17:20,256][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:17:20,917][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:17:21,574][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:17:22,232][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:17:22,890][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:17:23,550][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:17:24,207][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:17:24,865][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:17:25,525][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:17:26,184][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:17:26,842][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:17:27,500][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:17:28,159][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:17:28,818][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:17:29,477][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:17:30,138][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:17:30,801][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:17:31,587][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:17:33,131][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:17:33,133][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:17:33,134][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:17:34,582][__main__][INFO] - Iteration 75 took 52s (9.37% Gen, 87.87% Train). Generation: 4s, Training: 46s. Estimated remaining time: 13h 22m 54s. Estimated total time: 14h 34m 23s. Time estimates for 10 more iterations: 8m 44s, 100 more iterations: 1h 27m 26s, 500 more iterations: 7h 17m 11s. [2026-03-25 15:17:34,586][__main__][INFO] - Starting iteration 75. [2026-03-25 15:17:34,594][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:17:34,594][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:17:42,328][__main__][INFO] - Number of regex retries in iteration 75: 0 [2026-03-25 15:17:42,329][__main__][INFO] - agents played in iteration 75 are Bob, Alice [2026-03-25 15:17:43,186][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:17:43,247][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:17:43,247][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:17:43,248][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:17:44,137][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:17:44,767][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:17:45,427][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:17:46,086][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:17:46,744][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:17:47,403][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:17:48,062][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:17:48,723][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:17:49,381][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:17:50,041][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:17:50,703][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:17:51,366][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:17:52,021][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:17:52,683][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:17:53,341][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:17:54,002][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:17:54,661][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:17:55,321][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:17:55,981][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:17:56,641][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:17:57,299][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:17:57,958][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:17:58,618][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:17:59,277][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:17:59,936][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:18:00,597][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:18:01,256][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:18:01,915][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:18:02,574][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:18:03,234][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:18:03,893][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:18:04,553][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:18:05,212][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:18:05,872][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:18:06,531][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:18:07,191][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:18:07,849][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:18:08,507][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:18:09,167][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:18:09,828][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:18:10,486][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:18:11,143][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:18:11,802][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:18:12,461][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:18:13,120][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:18:13,780][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:18:14,440][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:18:15,099][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:18:16,074][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:18:16,733][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:18:17,391][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:18:18,050][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:18:18,710][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:18:19,368][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:18:20,026][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:18:20,685][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:18:21,343][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:18:22,002][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:18:22,660][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:18:23,319][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:18:23,977][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:18:24,635][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:18:25,295][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:18:25,953][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:18:26,611][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:18:27,473][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:18:28,993][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:18:28,996][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:18:28,997][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:18:30,392][__main__][INFO] - Iteration 76 took 55s (13.86% Gen, 83.63% Train). Generation: 7s, Training: 46s. Estimated remaining time: 14h 17m 35s. Estimated total time: 15h 30m 0s. Time estimates for 10 more iterations: 9m 18s, 100 more iterations: 1h 33m 0s, 500 more iterations: 7h 45m 0s. [2026-03-25 15:18:30,394][__main__][INFO] - Starting iteration 76. [2026-03-25 15:18:30,399][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:18:30,399][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:18:36,999][__main__][INFO] - Number of regex retries in iteration 76: 0 [2026-03-25 15:18:37,000][__main__][INFO] - agents played in iteration 76 are Bob, Alice [2026-03-25 15:18:37,760][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:18:37,820][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:18:37,821][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:18:37,821][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:18:38,593][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:18:39,205][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:18:39,866][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:18:40,526][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:18:41,186][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:18:41,844][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:18:42,504][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:18:43,163][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:18:43,823][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:18:44,483][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:18:45,143][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:18:45,802][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:18:46,463][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:18:47,122][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:18:47,782][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:18:48,441][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:18:49,100][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:18:49,760][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:18:50,421][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:18:51,084][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:18:51,744][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:18:52,407][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:18:53,068][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:18:53,726][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:18:54,387][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:18:55,046][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:18:55,705][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:18:56,365][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:18:57,024][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:18:57,683][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:18:58,341][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:18:59,000][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:18:59,660][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:19:00,319][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:19:00,978][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:19:01,640][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:19:02,299][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:19:02,958][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:19:03,616][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:19:04,275][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:19:04,933][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:19:05,593][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:19:06,252][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:19:06,910][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:19:07,569][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:19:08,228][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:19:08,887][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:19:09,547][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:19:10,527][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:19:11,188][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:19:11,846][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:19:12,504][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:19:13,163][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:19:13,823][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:19:14,481][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:19:15,141][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:19:15,801][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:19:16,461][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:19:17,120][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:19:17,778][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:19:18,437][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:19:19,096][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:19:19,755][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:19:20,416][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:19:21,073][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:19:21,850][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:19:23,211][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:19:23,213][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:19:23,215][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:19:24,650][__main__][INFO] - Iteration 77 took 54s (12.17% Gen, 85.18% Train). Generation: 6s, Training: 46s. Estimated remaining time: 13h 50m 55s. Estimated total time: 15h 4m 14s. Time estimates for 10 more iterations: 9m 2s, 100 more iterations: 1h 30m 25s, 500 more iterations: 7h 32m 7s. [2026-03-25 15:19:24,653][__main__][INFO] - Starting iteration 77. [2026-03-25 15:19:24,658][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:19:24,658][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:19:30,483][__main__][INFO] - Number of regex retries in iteration 77: 0 [2026-03-25 15:19:30,484][__main__][INFO] - agents played in iteration 77 are Bob, Alice [2026-03-25 15:19:31,395][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:19:31,463][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:19:31,463][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:19:31,464][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:19:32,208][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:19:32,821][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:19:33,483][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:19:34,143][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:19:34,804][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:19:35,464][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:19:36,124][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:19:36,784][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:19:37,445][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:19:38,104][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:19:38,764][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:19:39,422][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:19:40,082][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:19:40,742][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:19:41,402][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:19:42,061][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:19:42,720][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:19:43,380][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:19:44,038][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:19:44,697][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:19:45,360][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:19:46,018][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:19:46,678][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:19:47,339][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:19:47,999][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:19:48,658][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:19:49,319][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:19:49,978][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:19:50,638][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:19:51,298][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:19:51,957][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:19:52,616][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:19:53,275][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:19:53,935][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:19:54,595][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:19:55,253][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:19:55,912][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:19:56,572][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:19:57,231][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:19:57,890][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:19:58,549][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:19:59,208][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:19:59,866][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:20:00,525][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:20:01,184][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:20:01,842][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:20:02,502][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:20:03,161][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:20:04,148][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:20:04,806][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:20:05,464][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:20:06,122][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:20:06,779][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:20:07,438][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:20:08,097][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:20:08,755][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:20:09,413][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:20:10,072][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:20:10,730][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:20:11,389][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:20:12,047][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:20:12,705][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:20:13,363][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:20:14,022][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:20:14,680][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:20:15,417][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:20:16,768][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:20:16,771][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:20:16,772][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:20:18,404][__main__][INFO] - Iteration 78 took 53s (10.78% Gen, 86.12% Train). Generation: 5s, Training: 46s. Estimated remaining time: 13h 41m 35s. Estimated total time: 14h 55m 48s. Time estimates for 10 more iterations: 8m 57s, 100 more iterations: 1h 29m 34s, 500 more iterations: 7h 27m 54s. [2026-03-25 15:20:18,406][__main__][INFO] - Starting iteration 78. [2026-03-25 15:20:18,412][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:20:18,412][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:20:24,148][__main__][INFO] - Number of regex retries in iteration 78: 0 [2026-03-25 15:20:24,148][__main__][INFO] - agents played in iteration 78 are Bob, Alice [2026-03-25 15:20:24,930][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:20:24,991][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:20:24,992][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:20:24,992][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:20:25,855][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:20:26,496][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:20:27,155][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:20:27,815][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:20:28,474][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:20:29,134][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:20:29,792][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:20:30,450][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:20:31,108][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:20:31,765][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:20:32,423][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:20:33,081][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:20:33,740][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:20:34,398][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:20:35,057][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:20:35,715][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:20:36,373][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:20:37,031][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:20:37,690][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:20:38,348][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:20:39,005][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:20:39,663][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:20:40,321][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:20:40,979][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:20:41,636][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:20:42,295][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:20:42,953][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:20:43,611][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:20:44,268][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:20:44,926][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:20:45,583][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:20:46,242][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:20:46,900][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:20:47,558][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:20:48,217][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:20:48,875][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:20:49,533][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:20:50,190][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:20:50,848][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:20:51,507][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:20:52,164][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:20:52,822][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:20:53,480][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:20:54,137][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:20:54,796][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:20:55,454][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:20:56,113][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:20:56,771][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:20:57,758][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:20:58,418][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:20:59,080][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:20:59,738][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:21:00,398][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:21:01,057][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:21:01,715][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:21:02,375][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:21:03,034][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:21:03,692][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:21:04,350][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:21:05,008][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:21:05,667][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:21:06,326][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:21:06,984][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:21:07,643][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:21:08,302][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:21:09,073][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:21:10,438][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:21:10,441][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:21:10,442][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:21:11,986][__main__][INFO] - Iteration 79 took 53s (10.71% Gen, 86.41% Train). Generation: 5s, Training: 46s. Estimated remaining time: 13h 37m 50s. Estimated total time: 14h 52m 56s. Time estimates for 10 more iterations: 8m 55s, 100 more iterations: 1h 29m 17s, 500 more iterations: 7h 26m 28s. [2026-03-25 15:21:11,988][__main__][INFO] - Starting iteration 79. [2026-03-25 15:21:11,992][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:21:11,993][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:21:16,885][__main__][INFO] - Number of regex retries in iteration 79: 0 [2026-03-25 15:21:16,886][__main__][INFO] - agents played in iteration 79 are Bob, Alice [2026-03-25 15:21:17,403][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:21:17,466][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:21:17,467][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:21:17,468][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:21:18,155][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:21:18,781][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:21:19,442][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:21:20,102][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:21:20,762][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:21:21,420][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:21:22,081][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:21:22,741][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:21:23,399][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:21:24,058][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:21:24,716][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:21:25,377][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:21:26,036][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:21:26,695][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:21:27,354][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:21:28,014][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:21:28,673][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:21:29,335][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:21:29,994][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:21:30,653][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:21:31,314][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:21:31,974][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:21:32,633][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:21:33,293][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:21:33,951][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:21:34,610][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:21:35,271][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:21:35,930][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:21:36,588][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:21:37,248][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:21:37,907][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:21:38,566][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:21:39,225][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:21:39,885][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:21:40,545][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:21:41,203][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:21:41,862][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:21:42,522][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:21:43,181][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:21:43,841][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:21:44,500][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:21:45,160][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:21:45,820][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:21:46,479][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:21:47,139][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:21:47,800][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:21:48,461][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:21:49,122][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:21:50,100][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:21:50,761][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:21:51,421][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:21:52,080][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:21:52,739][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:21:53,399][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:21:54,056][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:21:54,715][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:21:55,375][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:21:56,033][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:21:56,691][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:21:57,349][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:21:58,006][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:21:58,665][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:21:59,323][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:21:59,981][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:22:00,639][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:22:01,517][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:22:03,316][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:22:03,319][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:22:03,321][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:22:04,965][__main__][INFO] - Iteration 80 took 52s (9.24% Gen, 87.65% Train). Generation: 4s, Training: 46s. Estimated remaining time: 13h 26m 55s. Estimated total time: 14h 42m 54s. Time estimates for 10 more iterations: 8m 49s, 100 more iterations: 1h 28m 17s, 500 more iterations: 7h 21m 27s. [2026-03-25 15:22:04,967][__main__][INFO] - Starting iteration 80. [2026-03-25 15:22:04,971][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:22:04,972][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:22:11,417][__main__][INFO] - Number of regex retries in iteration 80: 0 [2026-03-25 15:22:11,418][__main__][INFO] - agents played in iteration 80 are Bob, Alice [2026-03-25 15:22:11,992][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:22:12,053][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:22:12,054][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:22:12,054][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:22:12,929][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:22:13,540][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:22:14,201][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:22:14,860][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:22:15,520][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:22:16,179][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:22:16,838][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:22:17,498][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:22:18,156][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:22:18,814][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:22:19,473][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:22:20,131][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:22:20,790][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:22:21,449][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:22:22,108][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:22:22,767][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:22:23,426][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:22:24,086][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:22:24,745][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:22:25,403][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:22:26,062][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:22:26,722][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:22:27,381][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:22:28,039][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:22:28,698][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:22:29,357][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:22:30,016][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:22:30,676][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:22:31,335][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:22:31,994][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:22:32,653][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:22:33,312][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:22:33,972][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:22:34,632][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:22:35,291][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:22:35,950][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:22:36,609][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:22:37,268][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:22:37,927][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:22:38,592][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:22:39,250][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:22:39,908][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:22:40,568][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:22:41,228][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:22:41,888][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:22:42,547][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:22:43,206][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:22:43,865][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:22:44,893][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:22:45,552][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:22:46,210][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:22:46,870][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:22:47,529][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:22:48,186][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:22:48,845][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:22:49,503][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:22:50,161][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:22:50,822][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:22:51,478][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:22:52,139][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:22:52,797][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:22:53,456][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:22:54,115][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:22:54,774][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:22:55,432][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:22:56,252][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:22:58,141][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:22:58,144][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:22:58,146][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:22:59,678][__main__][INFO] - Iteration 81 took 54s (11.78% Gen, 85.41% Train). Generation: 6s, Training: 46s. Estimated remaining time: 13h 54m 54s. Estimated total time: 15h 11m 48s. Time estimates for 10 more iterations: 9m 7s, 100 more iterations: 1h 31m 10s, 500 more iterations: 7h 35m 54s. [2026-03-25 15:22:59,681][__main__][INFO] - Starting iteration 81. [2026-03-25 15:22:59,686][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:22:59,687][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:23:04,990][__main__][INFO] - Number of regex retries in iteration 81: 0 [2026-03-25 15:23:04,991][__main__][INFO] - agents played in iteration 81 are Bob, Alice [2026-03-25 15:23:05,447][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:23:05,508][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:23:05,509][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:23:05,509][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:23:06,214][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:23:06,822][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:23:07,481][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:23:08,139][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:23:08,800][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:23:09,456][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:23:10,117][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:23:10,776][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:23:11,436][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:23:12,095][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:23:12,753][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:23:13,413][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:23:14,073][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:23:14,731][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:23:15,390][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:23:16,047][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:23:16,705][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:23:17,365][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:23:18,023][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:23:18,681][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:23:19,340][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:23:19,998][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:23:20,656][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:23:21,314][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:23:21,974][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:23:22,633][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:23:23,291][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:23:23,951][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:23:24,610][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:23:25,267][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:23:25,925][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:23:26,584][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:23:27,241][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:23:27,899][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:23:28,558][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:23:29,217][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:23:29,875][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:23:30,533][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:23:31,191][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:23:31,851][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:23:32,509][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:23:33,167][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:23:33,826][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:23:34,484][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:23:35,142][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:23:35,801][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:23:36,458][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:23:37,117][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:23:38,091][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:23:38,751][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:23:39,410][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:23:40,068][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:23:40,726][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:23:41,384][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:23:42,043][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:23:42,702][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:23:43,359][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:23:44,018][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:23:44,676][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:23:45,334][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:23:45,993][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:23:46,652][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:23:47,309][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:23:47,968][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:23:48,627][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:23:49,411][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:23:50,837][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:23:50,840][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:23:50,841][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:23:52,401][__main__][INFO] - Iteration 82 took 52s (10.06% Gen, 86.97% Train). Generation: 5s, Training: 45s. Estimated remaining time: 13h 20m 50s. Estimated total time: 14h 38m 37s. Time estimates for 10 more iterations: 8m 47s, 100 more iterations: 1h 27m 51s, 500 more iterations: 7h 19m 18s. [2026-03-25 15:23:52,403][__main__][INFO] - Starting iteration 82. [2026-03-25 15:23:52,407][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:23:52,408][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:23:57,408][__main__][INFO] - Number of regex retries in iteration 82: 0 [2026-03-25 15:23:57,409][__main__][INFO] - agents played in iteration 82 are Bob, Alice [2026-03-25 15:23:57,987][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:23:58,049][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:23:58,050][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:23:58,050][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:23:58,953][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:23:59,565][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:24:00,225][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:24:00,884][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:24:01,542][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:24:02,201][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:24:02,860][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:24:03,518][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:24:04,176][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:24:04,835][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:24:05,494][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:24:06,152][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:24:06,810][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:24:07,468][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:24:08,126][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:24:08,783][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:24:09,441][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:24:10,099][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:24:10,757][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:24:11,416][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:24:12,073][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:24:12,731][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:24:13,389][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:24:14,046][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:24:14,705][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:24:15,363][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:24:16,022][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:24:16,680][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:24:17,338][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:24:17,996][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:24:18,653][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:24:19,311][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:24:19,968][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:24:20,627][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:24:21,285][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:24:21,942][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:24:22,601][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:24:23,260][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:24:23,918][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:24:24,577][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:24:25,236][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:24:25,893][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:24:26,552][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:24:27,210][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:24:27,868][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:24:28,526][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:24:29,185][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:24:29,843][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:24:30,818][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:24:31,479][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:24:32,136][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:24:32,794][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:24:33,453][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:24:34,110][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:24:34,768][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:24:35,427][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:24:36,085][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:24:36,743][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:24:37,401][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:24:38,061][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:24:38,718][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:24:39,379][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:24:40,036][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:24:40,695][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:24:41,353][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:24:42,067][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:24:43,234][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:24:43,236][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:24:43,237][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:24:44,913][__main__][INFO] - Iteration 83 took 52s (9.52% Gen, 87.28% Train). Generation: 5s, Training: 45s. Estimated remaining time: 13h 16m 28s. Estimated total time: 14h 35m 8s. Time estimates for 10 more iterations: 8m 45s, 100 more iterations: 1h 27m 30s, 500 more iterations: 7h 17m 34s. [2026-03-25 15:24:44,915][__main__][INFO] - Starting iteration 83. [2026-03-25 15:24:44,920][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:24:44,921][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:24:51,180][__main__][INFO] - Number of regex retries in iteration 83: 0 [2026-03-25 15:24:51,181][__main__][INFO] - agents played in iteration 83 are Bob, Alice [2026-03-25 15:24:51,979][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:24:52,039][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:24:52,040][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:24:52,041][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:24:52,867][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:24:53,484][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:24:54,142][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:24:54,797][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:24:55,457][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:24:56,116][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:24:56,774][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:24:57,434][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:24:58,093][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:24:58,752][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:24:59,409][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:25:00,067][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:25:00,724][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:25:01,384][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:25:02,043][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:25:02,702][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:25:03,360][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:25:04,018][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:25:04,676][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:25:05,334][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:25:05,992][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:25:06,651][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:25:07,309][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:25:07,967][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:25:08,625][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:25:09,283][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:25:09,941][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:25:10,599][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:25:11,258][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:25:11,915][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:25:12,572][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:25:13,231][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:25:13,890][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:25:14,547][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:25:15,205][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:25:15,863][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:25:16,521][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:25:17,180][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:25:17,837][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:25:18,495][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:25:19,153][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:25:19,810][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:25:20,469][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:25:21,127][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:25:21,785][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:25:22,443][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:25:23,102][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:25:23,760][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:25:24,739][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:25:25,397][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:25:26,055][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:25:26,713][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:25:27,372][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:25:28,030][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:25:28,688][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:25:29,347][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:25:30,004][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:25:30,664][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:25:31,323][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:25:31,981][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:25:32,641][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:25:33,298][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:25:33,958][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:25:34,615][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:25:35,273][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:25:36,083][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:25:38,019][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:25:38,022][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:25:38,023][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:25:39,506][__main__][INFO] - Iteration 84 took 54s (11.47% Gen, 85.81% Train). Generation: 6s, Training: 46s. Estimated remaining time: 13h 50m 13s. Estimated total time: 15h 9m 48s. Time estimates for 10 more iterations: 9m 5s, 100 more iterations: 1h 30m 58s, 500 more iterations: 7h 34m 54s. [2026-03-25 15:25:39,508][__main__][INFO] - Starting iteration 84. [2026-03-25 15:25:39,513][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:25:39,513][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:25:45,290][__main__][INFO] - Number of regex retries in iteration 84: 0 [2026-03-25 15:25:45,291][__main__][INFO] - agents played in iteration 84 are Bob, Alice [2026-03-25 15:25:45,902][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:25:45,968][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:25:45,969][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:25:45,969][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:25:46,619][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:25:47,229][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:25:47,889][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:25:48,549][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:25:49,207][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:25:49,866][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:25:50,524][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:25:51,183][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:25:51,840][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:25:52,499][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:25:53,157][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:25:53,816][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:25:54,472][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:25:55,131][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:25:55,788][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:25:56,445][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:25:57,103][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:25:57,761][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:25:58,420][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:25:59,078][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:25:59,736][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:26:00,393][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:26:01,052][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:26:01,710][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:26:02,367][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:26:03,026][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:26:03,684][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:26:04,342][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:26:04,999][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:26:05,656][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:26:06,313][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:26:06,972][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:26:07,631][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:26:08,290][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:26:08,948][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:26:09,607][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:26:10,265][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:26:10,924][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:26:11,582][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:26:12,240][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:26:12,898][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:26:13,556][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:26:14,214][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:26:14,872][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:26:15,532][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:26:16,190][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:26:16,848][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:26:17,506][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:26:18,488][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:26:19,148][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:26:19,806][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:26:20,465][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:26:21,122][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:26:21,783][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:26:22,442][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:26:23,099][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:26:23,759][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:26:24,423][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:26:25,084][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:26:25,742][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:26:26,400][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:26:27,060][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:26:27,717][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:26:28,376][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:26:29,035][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:26:29,834][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:26:31,343][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:26:31,346][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:26:31,348][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:26:32,907][__main__][INFO] - Iteration 85 took 53s (10.82% Gen, 86.25% Train). Generation: 5s, Training: 46s. Estimated remaining time: 13h 29m 29s. Estimated total time: 14h 49m 56s. Time estimates for 10 more iterations: 8m 53s, 100 more iterations: 1h 28m 59s, 500 more iterations: 7h 24m 58s. [2026-03-25 15:26:32,910][__main__][INFO] - Starting iteration 85. [2026-03-25 15:26:32,914][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:26:32,914][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:26:39,582][__main__][INFO] - Number of regex retries in iteration 85: 0 [2026-03-25 15:26:39,583][__main__][INFO] - agents played in iteration 85 are Bob, Alice [2026-03-25 15:26:40,074][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:26:40,137][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:26:40,137][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:26:40,138][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:26:40,835][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:26:41,449][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:26:42,109][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:26:42,769][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:26:43,428][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:26:44,088][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:26:44,746][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:26:45,407][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:26:46,065][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:26:46,724][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:26:47,383][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:26:48,042][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:26:48,703][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:26:49,361][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:26:50,019][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:26:50,677][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:26:51,340][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:26:51,999][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:26:52,658][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:26:53,317][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:26:53,977][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:26:54,635][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:26:55,294][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:26:55,953][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:26:56,612][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:26:57,272][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:26:57,930][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:26:58,589][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:26:59,249][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:26:59,908][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:27:00,567][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:27:01,226][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:27:01,884][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:27:02,543][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:27:03,202][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:27:03,861][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:27:04,520][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:27:05,179][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:27:05,837][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:27:06,496][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:27:07,155][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:27:07,813][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:27:08,472][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:27:09,131][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:27:09,790][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:27:10,448][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:27:11,107][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:27:11,766][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:27:12,742][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:27:13,403][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:27:14,061][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:27:14,718][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:27:15,377][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:27:16,034][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:27:16,693][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:27:17,353][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:27:18,011][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:27:18,667][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:27:19,326][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:27:19,984][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:27:20,643][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:27:21,302][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:27:21,962][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:27:22,620][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:27:23,278][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:27:24,045][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:27:25,503][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:27:25,505][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:27:25,507][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:27:27,186][__main__][INFO] - Iteration 86 took 54s (12.29% Gen, 84.62% Train). Generation: 6s, Training: 45s. Estimated remaining time: 13h 43m 11s. Estimated total time: 15h 4m 33s. Time estimates for 10 more iterations: 9m 2s, 100 more iterations: 1h 30m 27s, 500 more iterations: 7h 32m 16s. [2026-03-25 15:27:27,188][__main__][INFO] - Starting iteration 86. [2026-03-25 15:27:27,192][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:27:27,193][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:27:32,651][__main__][INFO] - Number of regex retries in iteration 86: 0 [2026-03-25 15:27:32,652][__main__][INFO] - agents played in iteration 86 are Bob, Alice [2026-03-25 15:27:33,209][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:27:33,270][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:27:33,271][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:27:33,271][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:27:33,997][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:27:34,606][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:27:35,264][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:27:35,923][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:27:36,581][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:27:37,238][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:27:37,896][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:27:38,553][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:27:39,211][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:27:39,869][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:27:40,527][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:27:41,184][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:27:41,841][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:27:42,499][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:27:43,157][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:27:43,814][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:27:44,471][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:27:45,128][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:27:45,787][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:27:46,445][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:27:47,103][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:27:47,761][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:27:48,419][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:27:49,078][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:27:49,735][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:27:50,392][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:27:51,050][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:27:51,708][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:27:52,366][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:27:53,023][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:27:53,680][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:27:54,340][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:27:54,997][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:27:55,660][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:27:56,317][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:27:56,976][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:27:57,635][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:27:58,293][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:27:58,953][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:27:59,611][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:28:00,270][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:28:00,929][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:28:01,588][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:28:02,245][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:28:02,903][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:28:03,562][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:28:04,221][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:28:04,878][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:28:05,860][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:28:06,520][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:28:07,178][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:28:07,836][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:28:08,494][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:28:09,153][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:28:09,811][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:28:10,470][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:28:11,129][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:28:11,787][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:28:12,445][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:28:13,104][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:28:13,761][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:28:14,420][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:28:15,077][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:28:15,735][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:28:16,394][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:28:17,270][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:28:18,690][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:28:18,693][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:28:18,694][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:28:20,174][__main__][INFO] - Iteration 87 took 52s (10.30% Gen, 86.90% Train). Generation: 5s, Training: 46s. Estimated remaining time: 13h 20m 49s. Estimated total time: 14h 43m 4s. Time estimates for 10 more iterations: 8m 49s, 100 more iterations: 1h 28m 18s, 500 more iterations: 7h 21m 32s. [2026-03-25 15:28:20,177][__main__][INFO] - Starting iteration 87. [2026-03-25 15:28:20,182][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:28:20,182][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:28:25,099][__main__][INFO] - Number of regex retries in iteration 87: 0 [2026-03-25 15:28:25,100][__main__][INFO] - agents played in iteration 87 are Bob, Alice [2026-03-25 15:28:25,996][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:28:26,058][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:28:26,059][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:28:26,060][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:28:26,736][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:28:27,347][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:28:28,008][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:28:28,667][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:28:29,327][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:28:29,985][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:28:30,644][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:28:31,303][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:28:31,963][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:28:32,622][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:28:33,282][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:28:33,942][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:28:34,601][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:28:35,260][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:28:35,918][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:28:36,577][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:28:37,236][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:28:37,894][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:28:38,552][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:28:39,211][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:28:39,871][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:28:40,530][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:28:41,189][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:28:41,847][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:28:42,505][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:28:43,163][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:28:43,822][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:28:44,481][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:28:45,139][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:28:45,798][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:28:46,457][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:28:47,115][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:28:47,774][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:28:48,432][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:28:49,090][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:28:49,749][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:28:50,408][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:28:51,067][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:28:51,726][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:28:52,384][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:28:53,043][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:28:53,703][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:28:54,362][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:28:55,021][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:28:55,680][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:28:56,339][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:28:56,997][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:28:57,656][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:28:58,646][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:28:59,307][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:28:59,965][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:29:00,623][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:29:01,281][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:29:01,938][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:29:02,597][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:29:03,255][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:29:03,914][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:29:04,571][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:29:05,229][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:29:05,887][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:29:06,544][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:29:07,202][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:29:07,860][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:29:08,517][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:29:09,175][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:29:09,935][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:29:11,392][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:29:11,395][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:29:11,396][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:29:12,935][__main__][INFO] - Iteration 88 took 52s (9.32% Gen, 87.76% Train). Generation: 4s, Training: 46s. Estimated remaining time: 13h 16m 7s. Estimated total time: 14h 39m 15s. Time estimates for 10 more iterations: 8m 47s, 100 more iterations: 1h 27m 55s, 500 more iterations: 7h 19m 37s. [2026-03-25 15:29:12,937][__main__][INFO] - Starting iteration 88. [2026-03-25 15:29:12,941][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:29:12,941][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:29:20,469][__main__][INFO] - Number of regex retries in iteration 88: 0 [2026-03-25 15:29:20,470][__main__][INFO] - agents played in iteration 88 are Bob, Alice [2026-03-25 15:29:21,534][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:29:21,596][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:29:21,597][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:29:21,597][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:29:22,405][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:29:23,023][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:29:23,685][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:29:24,342][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:29:25,003][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:29:25,662][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:29:26,321][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:29:26,981][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:29:27,641][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:29:28,300][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:29:28,959][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:29:29,618][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:29:30,276][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:29:30,935][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:29:31,594][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:29:32,252][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:29:32,911][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:29:33,569][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:29:34,228][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:29:34,887][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:29:35,546][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:29:36,206][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:29:36,864][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:29:37,522][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:29:38,180][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:29:38,838][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:29:39,498][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:29:40,157][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:29:40,816][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:29:41,474][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:29:42,133][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:29:42,793][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:29:43,452][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:29:44,110][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:29:44,769][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:29:45,427][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:29:46,085][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:29:46,744][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:29:47,403][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:29:48,062][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:29:48,721][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:29:49,380][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:29:50,038][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:29:50,697][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:29:51,356][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:29:52,015][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:29:52,673][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:29:53,331][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:29:54,309][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:29:54,969][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:29:55,627][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:29:56,286][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:29:56,943][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:29:57,601][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:29:58,259][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:29:58,920][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:29:59,580][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:30:00,236][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:30:00,897][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:30:01,554][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:30:02,211][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:30:02,869][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:30:03,529][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:30:04,188][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:30:04,846][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:30:05,538][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:30:06,842][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:30:06,844][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:30:06,846][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:30:08,339][__main__][INFO] - Iteration 89 took 55s (13.59% Gen, 83.71% Train). Generation: 7s, Training: 46s. Estimated remaining time: 13h 59m 16s. Estimated total time: 15h 23m 19s. Time estimates for 10 more iterations: 9m 13s, 100 more iterations: 1h 32m 19s, 500 more iterations: 7h 41m 39s. [2026-03-25 15:30:08,341][__main__][INFO] - Starting iteration 89. [2026-03-25 15:30:08,345][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:30:08,346][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:30:13,250][__main__][INFO] - Number of regex retries in iteration 89: 0 [2026-03-25 15:30:13,251][__main__][INFO] - agents played in iteration 89 are Bob, Alice [2026-03-25 15:30:13,718][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:30:13,780][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:30:13,781][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:30:13,781][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:30:14,434][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:30:15,055][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:30:15,718][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:30:16,377][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:30:17,037][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:30:17,695][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:30:18,353][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:30:19,012][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:30:19,671][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:30:20,330][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:30:20,989][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:30:21,648][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:30:22,306][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:30:22,966][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:30:23,625][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:30:24,283][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:30:24,942][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:30:25,600][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:30:26,259][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:30:26,917][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:30:27,576][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:30:28,234][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:30:28,894][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:30:29,554][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:30:30,213][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:30:30,872][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:30:31,531][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:30:32,191][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:30:32,849][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:30:33,508][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:30:34,167][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:30:34,827][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:30:35,486][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:30:36,144][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:30:36,803][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:30:37,463][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:30:38,122][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:30:38,781][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:30:39,441][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:30:40,101][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:30:40,759][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:30:41,417][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:30:42,076][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:30:42,734][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:30:43,393][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:30:44,052][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:30:44,711][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:30:45,370][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:30:46,346][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:30:47,004][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:30:47,663][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:30:48,321][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:30:48,978][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:30:49,635][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:30:50,294][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:30:50,953][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:30:51,610][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:30:52,268][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:30:52,927][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:30:53,584][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:30:54,241][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:30:54,900][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:30:55,558][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:30:56,219][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:30:56,877][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:30:57,639][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:30:58,978][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:30:58,981][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:30:58,982][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:31:00,701][__main__][INFO] - Iteration 90 took 52s (9.37% Gen, 87.34% Train). Generation: 4s, Training: 45s. Estimated remaining time: 13h 7m 42s. Estimated total time: 14h 32m 38s. Time estimates for 10 more iterations: 8m 43s, 100 more iterations: 1h 27m 15s, 500 more iterations: 7h 16m 19s. [2026-03-25 15:31:00,704][__main__][INFO] - Starting iteration 90. [2026-03-25 15:31:00,708][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:31:00,708][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:31:09,050][__main__][INFO] - Number of regex retries in iteration 90: 0 [2026-03-25 15:31:09,051][__main__][INFO] - agents played in iteration 90 are Bob, Alice [2026-03-25 15:31:09,615][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:31:09,677][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:31:09,677][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:31:09,678][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:31:10,346][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:31:10,961][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:31:11,622][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:31:12,282][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:31:12,941][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:31:13,600][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:31:14,259][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:31:14,919][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:31:15,578][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:31:16,238][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:31:16,897][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:31:17,555][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:31:18,213][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:31:18,872][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:31:19,530][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:31:20,189][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:31:20,847][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:31:21,507][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:31:22,167][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:31:22,825][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:31:23,484][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:31:24,142][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:31:24,801][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:31:25,460][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:31:26,118][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:31:26,777][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:31:27,435][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:31:28,095][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:31:28,754][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:31:29,413][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:31:30,072][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:31:30,731][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:31:31,389][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:31:32,048][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:31:32,706][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:31:33,364][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:31:34,024][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:31:34,683][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:31:35,342][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:31:36,001][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:31:36,660][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:31:37,319][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:31:37,978][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:31:38,638][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:31:39,296][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:31:39,956][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:31:40,615][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:31:41,275][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:31:42,247][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:31:42,909][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:31:43,568][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:31:44,229][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:31:44,887][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:31:45,546][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:31:46,207][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:31:46,866][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:31:47,525][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:31:48,185][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:31:48,844][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:31:49,503][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:31:50,161][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:31:50,820][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:31:51,478][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:31:52,138][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:31:52,797][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:31:53,565][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:31:55,027][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:31:55,030][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:31:55,031][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:31:56,858][__main__][INFO] - Iteration 91 took 56s (14.86% Gen, 81.89% Train). Generation: 8s, Training: 45s. Estimated remaining time: 14h 10m 0s. Estimated total time: 15h 35m 52s. Time estimates for 10 more iterations: 9m 21s, 100 more iterations: 1h 33m 35s, 500 more iterations: 7h 47m 56s. [2026-03-25 15:31:56,860][__main__][INFO] - Starting iteration 91. [2026-03-25 15:31:56,864][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:31:56,865][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:32:01,875][__main__][INFO] - Number of regex retries in iteration 91: 0 [2026-03-25 15:32:01,877][__main__][INFO] - agents played in iteration 91 are Bob, Alice [2026-03-25 15:32:02,345][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:32:02,406][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:32:02,406][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:32:02,407][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:32:03,091][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:32:03,702][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:32:04,363][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:32:05,022][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:32:05,683][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:32:06,343][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:32:07,002][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:32:07,661][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:32:08,320][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:32:08,978][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:32:09,636][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:32:10,295][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:32:10,953][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:32:11,615][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:32:12,274][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:32:12,933][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:32:13,593][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:32:14,251][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:32:14,909][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:32:15,568][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:32:16,227][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:32:16,885][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:32:17,545][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:32:18,205][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:32:18,864][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:32:19,524][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:32:20,183][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:32:20,842][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:32:21,502][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:32:22,162][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:32:22,820][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:32:23,480][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:32:24,139][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:32:24,798][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:32:25,457][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:32:26,116][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:32:26,777][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:32:27,436][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:32:28,096][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:32:28,756][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:32:29,415][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:32:30,074][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:32:30,733][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:32:31,394][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:32:32,054][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:32:32,713][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:32:33,373][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:32:34,032][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:32:35,009][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:32:35,668][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:32:36,327][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:32:36,985][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:32:37,643][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:32:38,301][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:32:38,961][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:32:39,620][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:32:40,278][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:32:40,936][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:32:41,595][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:32:42,253][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:32:42,911][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:32:43,568][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:32:44,228][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:32:44,886][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:32:45,544][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:32:46,307][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:32:47,704][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:32:47,706][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:32:47,707][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:32:49,192][__main__][INFO] - Iteration 92 took 52s (9.58% Gen, 87.58% Train). Generation: 5s, Training: 45s. Estimated remaining time: 13h 5m 25s. Estimated total time: 14h 32m 9s. Time estimates for 10 more iterations: 8m 43s, 100 more iterations: 1h 27m 12s, 500 more iterations: 7h 16m 4s. [2026-03-25 15:32:49,194][__main__][INFO] - Starting iteration 92. [2026-03-25 15:32:49,199][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:32:49,199][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:32:54,979][__main__][INFO] - Number of regex retries in iteration 92: 0 [2026-03-25 15:32:54,980][__main__][INFO] - agents played in iteration 92 are Bob, Alice [2026-03-25 15:32:55,820][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:32:55,880][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:32:55,881][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:32:55,881][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:32:56,703][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:32:57,327][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:32:57,987][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:32:58,647][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:32:59,307][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:32:59,968][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:33:00,627][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:33:01,288][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:33:01,946][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:33:02,606][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:33:03,264][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:33:03,922][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:33:04,582][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:33:05,241][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:33:05,899][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:33:06,557][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:33:07,216][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:33:07,875][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:33:08,533][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:33:09,191][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:33:09,851][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:33:10,511][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:33:11,169][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:33:11,827][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:33:12,486][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:33:13,145][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:33:13,805][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:33:14,463][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:33:15,122][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:33:15,780][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:33:16,441][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:33:17,100][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:33:17,759][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:33:18,419][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:33:19,078][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:33:19,737][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:33:20,395][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:33:21,055][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:33:21,715][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:33:22,374][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:33:23,033][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:33:23,691][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:33:24,349][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:33:25,008][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:33:25,666][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:33:26,330][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:33:26,987][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:33:27,645][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:33:28,628][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:33:29,288][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:33:29,945][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:33:30,605][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:33:31,263][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:33:31,921][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:33:32,581][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:33:33,241][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:33:33,901][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:33:34,560][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:33:35,218][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:33:35,877][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:33:36,535][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:33:37,193][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:33:37,850][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:33:38,509][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:33:39,167][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:33:40,068][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:33:41,417][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:33:41,420][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:33:41,421][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:33:42,966][__main__][INFO] - Iteration 93 took 53s (10.75% Gen, 86.37% Train). Generation: 5s, Training: 46s. Estimated remaining time: 13h 28m 31s. Estimated total time: 14h 56m 8s. Time estimates for 10 more iterations: 8m 57s, 100 more iterations: 1h 29m 36s, 500 more iterations: 7h 28m 4s. [2026-03-25 15:33:42,968][__main__][INFO] - Starting iteration 93. [2026-03-25 15:33:42,972][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:33:42,973][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:33:48,595][__main__][INFO] - Number of regex retries in iteration 93: 0 [2026-03-25 15:33:48,596][__main__][INFO] - agents played in iteration 93 are Bob, Alice [2026-03-25 15:33:49,088][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:33:49,150][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:33:49,151][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:33:49,151][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:33:49,803][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:33:50,425][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:33:51,087][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:33:51,747][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:33:52,406][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:33:53,066][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:33:53,727][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:33:54,385][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:33:55,044][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:33:55,702][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:33:56,361][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:33:57,021][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:33:57,681][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:33:58,340][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:33:59,000][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:33:59,659][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:34:00,318][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:34:00,977][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:34:01,636][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:34:02,295][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:34:02,954][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:34:03,613][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:34:04,271][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:34:04,929][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:34:05,588][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:34:06,247][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:34:06,906][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:34:07,565][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:34:08,223][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:34:08,882][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:34:09,542][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:34:10,201][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:34:10,860][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:34:11,519][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:34:12,178][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:34:12,836][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:34:13,496][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:34:14,154][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:34:14,813][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:34:15,472][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:34:16,131][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:34:16,790][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:34:17,450][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:34:18,109][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:34:18,769][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:34:19,428][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:34:20,087][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:34:20,747][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:34:21,723][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:34:22,381][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:34:23,039][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:34:23,697][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:34:24,357][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:34:25,014][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:34:25,674][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:34:26,332][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:34:26,991][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:34:27,650][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:34:28,307][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:34:28,966][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:34:29,626][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:34:30,285][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:34:30,943][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:34:31,602][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:34:32,262][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:34:32,939][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:34:34,667][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:34:34,669][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:34:34,670][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:34:36,177][__main__][INFO] - Iteration 94 took 53s (10.57% Gen, 86.60% Train). Generation: 5s, Training: 46s. Estimated remaining time: 13h 18m 15s. Estimated total time: 14h 46m 46s. Time estimates for 10 more iterations: 8m 52s, 100 more iterations: 1h 28m 40s, 500 more iterations: 7h 23m 23s. [2026-03-25 15:34:36,179][__main__][INFO] - Starting iteration 94. [2026-03-25 15:34:36,183][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:34:36,184][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:34:41,282][__main__][INFO] - Number of regex retries in iteration 94: 0 [2026-03-25 15:34:41,283][__main__][INFO] - agents played in iteration 94 are Bob, Alice [2026-03-25 15:34:41,831][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:34:41,891][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:34:41,892][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:34:41,892][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:34:42,526][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:34:43,146][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:34:43,807][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:34:44,466][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:34:45,124][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:34:45,783][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:34:46,441][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:34:47,102][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:34:47,761][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:34:48,420][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:34:49,078][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:34:49,737][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:34:50,395][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:34:51,054][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:34:51,712][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:34:52,370][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:34:53,029][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:34:53,688][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:34:54,346][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:34:55,005][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:34:55,666][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:34:56,326][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:34:56,985][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:34:57,646][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:34:58,305][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:34:58,965][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:34:59,625][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:35:00,283][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:35:00,944][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:35:01,604][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:35:02,263][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:35:02,923][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:35:03,585][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:35:04,244][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:35:04,905][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:35:05,565][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:35:06,225][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:35:06,883][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:35:07,542][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:35:08,201][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:35:08,860][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:35:09,520][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:35:10,180][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:35:10,841][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:35:11,499][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:35:12,158][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:35:12,817][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:35:13,478][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:35:14,452][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:35:15,111][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:35:15,769][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:35:16,426][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:35:17,084][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:35:17,743][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:35:18,403][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:35:19,060][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:35:19,719][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:35:20,378][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:35:21,036][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:35:21,696][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:35:22,356][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:35:23,014][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:35:23,672][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:35:24,330][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:35:24,988][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:35:25,731][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:35:27,063][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:35:27,065][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:35:27,067][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:35:28,643][__main__][INFO] - Iteration 95 took 52s (9.72% Gen, 87.27% Train). Generation: 5s, Training: 45s. Estimated remaining time: 13h 4m 57s. Estimated total time: 14h 34m 21s. Time estimates for 10 more iterations: 8m 44s, 100 more iterations: 1h 27m 26s, 500 more iterations: 7h 17m 10s. [2026-03-25 15:35:28,645][__main__][INFO] - Starting iteration 95. [2026-03-25 15:35:28,649][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:35:28,650][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:35:33,728][__main__][INFO] - Number of regex retries in iteration 95: 0 [2026-03-25 15:35:33,729][__main__][INFO] - agents played in iteration 95 are Bob, Alice [2026-03-25 15:35:34,186][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:35:34,247][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:35:34,248][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:35:34,248][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:35:35,003][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:35:35,607][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:35:36,267][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:35:36,925][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:35:37,583][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:35:38,244][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:35:38,901][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:35:39,559][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:35:40,219][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:35:40,879][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:35:41,536][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:35:42,195][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:35:42,855][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:35:43,515][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:35:44,174][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:35:44,832][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:35:45,492][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:35:46,150][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:35:46,811][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:35:47,471][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:35:48,133][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:35:48,790][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:35:49,449][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:35:50,109][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:35:50,768][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:35:51,427][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:35:52,088][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:35:52,747][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:35:53,406][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:35:54,066][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:35:54,725][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:35:55,383][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:35:56,042][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:35:56,701][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:35:57,360][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:35:58,019][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:35:58,680][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:35:59,338][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:35:59,996][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:36:00,656][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:36:01,315][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:36:01,974][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:36:02,632][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:36:03,292][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:36:03,952][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:36:04,610][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:36:05,269][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:36:05,929][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:36:06,909][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:36:07,568][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:36:08,227][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:36:08,886][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:36:09,545][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:36:10,204][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:36:10,863][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:36:11,522][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:36:12,181][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:36:12,840][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:36:13,498][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:36:14,157][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:36:14,815][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:36:15,474][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:36:16,131][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:36:16,789][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:36:17,447][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:36:18,204][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:36:19,620][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:36:19,622][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:36:19,624][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:36:21,062][__main__][INFO] - Iteration 96 took 52s (9.69% Gen, 87.56% Train). Generation: 5s, Training: 45s. Estimated remaining time: 13h 3m 19s. Estimated total time: 14h 33m 34s. Time estimates for 10 more iterations: 8m 44s, 100 more iterations: 1h 27m 21s, 500 more iterations: 7h 16m 47s. [2026-03-25 15:36:21,064][__main__][INFO] - Starting iteration 96. [2026-03-25 15:36:21,068][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:36:21,068][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:36:27,812][__main__][INFO] - Number of regex retries in iteration 96: 0 [2026-03-25 15:36:27,813][__main__][INFO] - agents played in iteration 96 are Bob, Alice [2026-03-25 15:36:28,370][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:36:28,431][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:36:28,432][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:36:28,432][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:36:29,285][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:36:29,902][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:36:30,560][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:36:31,219][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:36:31,878][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:36:32,537][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:36:33,195][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:36:33,855][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:36:34,515][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:36:35,172][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:36:35,831][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:36:36,489][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:36:37,148][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:36:37,807][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:36:38,465][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:36:39,123][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:36:39,781][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:36:40,439][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:36:41,097][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:36:41,755][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:36:42,416][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:36:43,074][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:36:43,732][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:36:44,390][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:36:45,047][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:36:45,706][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:36:46,366][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:36:47,024][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:36:47,682][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:36:48,339][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:36:48,997][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:36:49,655][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:36:50,314][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:36:50,972][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:36:51,629][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:36:52,291][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:36:52,950][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:36:53,609][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:36:54,267][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:36:54,927][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:36:55,585][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:36:56,244][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:36:56,903][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:36:57,562][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:36:58,220][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:36:58,879][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:36:59,538][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:37:00,197][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:37:01,173][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:37:01,833][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:37:02,492][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:37:03,149][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:37:03,807][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:37:04,467][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:37:05,125][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:37:05,783][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:37:06,442][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:37:07,100][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:37:07,758][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:37:08,417][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:37:09,077][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:37:09,736][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:37:10,393][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:37:11,051][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:37:11,709][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:37:12,504][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:37:13,868][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:37:13,871][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:37:13,873][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:37:15,430][__main__][INFO] - Iteration 97 took 54s (12.41% Gen, 84.72% Train). Generation: 6s, Training: 46s. Estimated remaining time: 13h 34m 54s. Estimated total time: 15h 6m 4s. Time estimates for 10 more iterations: 9m 3s, 100 more iterations: 1h 30m 36s, 500 more iterations: 7h 33m 2s. [2026-03-25 15:37:15,432][__main__][INFO] - Starting iteration 97. [2026-03-25 15:37:15,436][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:37:15,437][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:37:21,089][__main__][INFO] - Number of regex retries in iteration 97: 0 [2026-03-25 15:37:21,090][__main__][INFO] - agents played in iteration 97 are Bob, Alice [2026-03-25 15:37:21,953][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:37:22,015][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:37:22,015][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:37:22,016][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:37:22,702][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:37:23,325][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:37:23,983][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:37:24,643][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:37:25,303][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:37:25,962][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:37:26,623][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:37:27,282][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:37:27,942][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:37:28,601][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:37:29,260][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:37:29,919][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:37:30,579][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:37:31,239][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:37:31,899][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:37:32,559][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:37:33,218][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:37:33,877][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:37:34,536][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:37:35,195][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:37:35,854][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:37:36,513][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:37:37,172][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:37:37,830][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:37:38,489][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:37:39,148][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:37:39,807][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:37:40,467][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:37:41,126][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:37:41,786][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:37:42,445][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:37:43,105][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:37:43,763][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:37:44,422][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:37:45,081][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:37:45,739][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:37:46,398][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:37:47,057][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:37:47,717][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:37:48,376][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:37:49,035][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:37:49,695][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:37:50,353][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:37:51,013][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:37:51,671][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:37:52,331][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:37:52,990][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:37:53,649][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:37:54,628][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:37:55,288][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:37:55,947][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:37:56,605][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:37:57,264][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:37:57,922][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:37:58,581][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:37:59,239][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:37:59,898][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:38:00,555][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:38:01,214][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:38:01,872][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:38:02,531][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:38:03,189][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:38:03,849][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:38:04,506][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:38:05,166][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:38:05,936][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:38:07,252][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:38:07,255][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:38:07,256][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:38:08,740][__main__][INFO] - Iteration 98 took 53s (10.60% Gen, 86.61% Train). Generation: 5s, Training: 46s. Estimated remaining time: 13h 16m 22s. Estimated total time: 14h 48m 25s. Time estimates for 10 more iterations: 8m 53s, 100 more iterations: 1h 28m 50s, 500 more iterations: 7h 24m 12s. [2026-03-25 15:38:08,742][__main__][INFO] - Starting iteration 98. [2026-03-25 15:38:08,746][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:38:08,747][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:38:13,594][__main__][INFO] - Number of regex retries in iteration 98: 0 [2026-03-25 15:38:13,595][__main__][INFO] - agents played in iteration 98 are Bob, Alice [2026-03-25 15:38:14,153][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:38:14,213][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:38:14,213][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:38:14,215][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:38:14,988][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:38:15,628][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:38:16,288][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:38:16,948][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:38:17,607][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:38:18,269][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:38:18,928][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:38:19,588][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:38:20,246][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:38:20,907][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:38:21,566][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:38:22,227][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:38:22,886][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:38:23,545][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:38:24,204][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:38:24,863][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:38:25,522][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:38:26,181][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:38:26,840][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:38:27,499][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:38:28,160][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:38:28,820][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:38:29,479][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:38:30,137][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:38:30,796][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:38:31,457][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:38:32,115][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:38:32,774][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:38:33,434][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:38:34,093][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:38:34,752][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:38:35,412][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:38:36,070][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:38:36,729][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:38:37,388][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:38:38,046][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:38:38,705][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:38:39,364][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:38:40,025][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:38:40,682][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:38:41,342][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:38:42,002][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:38:42,662][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:38:43,321][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:38:43,981][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:38:44,640][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:38:45,299][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:38:45,959][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:38:46,934][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:38:47,594][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:38:48,252][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:38:48,909][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:38:49,568][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:38:50,228][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:38:50,886][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:38:51,544][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:38:52,202][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:38:52,861][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:38:53,519][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:38:54,179][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:38:54,837][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:38:55,495][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:38:56,153][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:38:56,813][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:38:57,471][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:38:58,244][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:39:00,112][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:39:00,115][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:39:00,116][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:39:01,545][__main__][INFO] - Iteration 99 took 52s (9.18% Gen, 88.11% Train). Generation: 4s, Training: 46s. Estimated remaining time: 13h 7m 4s. Estimated total time: 14h 40m 0s. Time estimates for 10 more iterations: 8m 48s, 100 more iterations: 1h 28m 0s, 500 more iterations: 7h 20m 0s. [2026-03-25 15:39:01,547][__main__][INFO] - Starting iteration 99. [2026-03-25 15:39:01,552][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:39:01,552][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:39:06,491][__main__][INFO] - Number of regex retries in iteration 99: 0 [2026-03-25 15:39:06,493][__main__][INFO] - agents played in iteration 99 are Bob, Alice [2026-03-25 15:39:07,055][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:39:07,116][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:39:07,117][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:39:07,117][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:39:07,795][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:39:08,465][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:39:09,077][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:39:09,737][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:39:10,397][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:39:11,057][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:39:11,719][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:39:12,379][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:39:13,039][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:39:13,701][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:39:14,360][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:39:15,019][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:39:15,680][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:39:16,339][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:39:16,998][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:39:17,658][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:39:18,318][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:39:18,977][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:39:19,638][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:39:20,299][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:39:20,957][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:39:21,617][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:39:22,279][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:39:22,938][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:39:23,599][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:39:24,259][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:39:24,919][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:39:25,578][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:39:26,240][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:39:26,898][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:39:27,557][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:39:28,217][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:39:28,878][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:39:29,538][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:39:30,197][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:39:30,856][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:39:31,515][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:39:32,176][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:39:32,836][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:39:33,495][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:39:34,154][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:39:34,813][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:39:35,473][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:39:36,132][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:39:36,790][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:39:37,451][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:39:38,111][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:39:38,770][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:39:39,754][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:39:40,414][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:39:41,075][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:39:41,734][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:39:42,392][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:39:43,050][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:39:43,710][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:39:44,368][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:39:45,027][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:39:45,684][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:39:46,348][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:39:47,008][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:39:47,667][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:39:48,326][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:39:48,984][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:39:49,642][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:39:50,300][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:39:51,751][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:39:53,631][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:39:53,634][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:39:53,635][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:39:55,035][__main__][INFO] - Iteration 100 took 53s (9.24% Gen, 88.14% Train). Generation: 4s, Training: 47s. Estimated remaining time: 13h 17m 36s. Estimated total time: 14h 51m 25s. Time estimates for 10 more iterations: 8m 54s, 100 more iterations: 1h 29m 8s, 500 more iterations: 7h 25m 42s. [2026-03-25 15:39:55,037][__main__][INFO] - Starting iteration 100. [2026-03-25 15:39:55,042][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:39:55,042][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:40:00,125][__main__][INFO] - Number of regex retries in iteration 100: 0 [2026-03-25 15:40:00,127][__main__][INFO] - agents played in iteration 100 are Bob, Alice [2026-03-25 15:40:00,647][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:40:00,708][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:40:00,708][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:40:00,709][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:40:01,526][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:40:02,145][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:40:02,805][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:40:03,465][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:40:04,125][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:40:04,783][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:40:05,444][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:40:06,105][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:40:06,764][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:40:07,422][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:40:08,083][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:40:08,742][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:40:09,401][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:40:10,061][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:40:10,720][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:40:11,380][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:40:12,039][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:40:12,698][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:40:13,359][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:40:14,018][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:40:14,677][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:40:15,340][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:40:16,000][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:40:16,660][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:40:17,320][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:40:17,979][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:40:18,640][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:40:19,299][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:40:19,958][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:40:20,618][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:40:21,279][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:40:21,938][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:40:22,597][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:40:23,257][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:40:23,917][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:40:24,575][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:40:25,235][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:40:25,896][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:40:26,555][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:40:27,214][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:40:27,873][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:40:28,534][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:40:29,193][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:40:29,852][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:40:30,512][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:40:31,172][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:40:31,831][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:40:32,491][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:40:33,473][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:40:34,135][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:40:34,794][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:40:35,453][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:40:36,114][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:40:36,773][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:40:37,431][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:40:38,089][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:40:38,747][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:40:39,406][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:40:40,064][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:40:40,723][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:40:41,382][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:40:42,041][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:40:42,700][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:40:43,358][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:40:44,017][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:40:44,957][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:40:46,277][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:40:46,280][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:40:46,281][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:40:49,526][__main__][INFO] - Iteration 101 took 54s (9.33% Gen, 84.71% Train). Generation: 5s, Training: 46s. Estimated remaining time: 13h 33m 21s. Estimated total time: 15h 8m 6s. Time estimates for 10 more iterations: 9m 4s, 100 more iterations: 1h 30m 48s, 500 more iterations: 7h 34m 3s. [2026-03-25 15:40:49,528][__main__][INFO] - Starting iteration 101. [2026-03-25 15:40:49,532][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 15:40:49,532][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:40:55,334][__main__][INFO] - Number of regex retries in iteration 101: 0 [2026-03-25 15:40:55,335][__main__][INFO] - agents played in iteration 101 are Bob, Alice [2026-03-25 15:40:56,388][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:40:56,450][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:40:56,451][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:40:56,451][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:40:57,267][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:40:57,913][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:40:58,552][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:40:59,213][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:40:59,871][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:41:00,529][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:41:01,188][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:41:01,850][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:41:02,509][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:41:03,170][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:41:03,832][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:41:04,492][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:41:05,151][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:41:05,809][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:41:06,468][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:41:07,126][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:41:07,783][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:41:08,441][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:41:09,099][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:41:09,757][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:41:10,415][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:41:11,072][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:41:11,731][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:41:12,390][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:41:13,048][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:41:13,706][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:41:14,364][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:41:15,024][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:41:15,682][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:41:16,340][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:41:16,998][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:41:17,656][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:41:18,314][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:41:18,972][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:41:19,630][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:41:20,288][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:41:20,945][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:41:21,604][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:41:22,261][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:41:22,919][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:41:23,578][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:41:24,236][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:41:24,893][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:41:25,551][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:41:26,209][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:41:26,868][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:41:27,525][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:41:28,183][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:41:29,163][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:41:29,823][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:41:30,481][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:41:31,140][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:41:31,798][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:41:32,456][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:41:33,114][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:41:33,772][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:41:34,429][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:41:35,088][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:41:35,745][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:41:36,403][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:41:37,061][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:41:37,719][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:41:38,377][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:41:39,035][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:41:39,693][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:41:40,546][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:41:42,057][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:41:42,060][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:41:42,061][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:41:43,648][__main__][INFO] - Iteration 102 took 54s (10.72% Gen, 86.34% Train). Generation: 5s, Training: 46s. Estimated remaining time: 13h 26m 19s. Estimated total time: 15h 1m 57s. Time estimates for 10 more iterations: 9m 1s, 100 more iterations: 1h 30m 11s, 500 more iterations: 7h 30m 58s. [2026-03-25 15:41:43,650][__main__][INFO] - Starting iteration 102. [2026-03-25 15:41:43,654][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 15:41:43,654][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:41:53,443][__main__][INFO] - Number of regex retries in iteration 102: 0 [2026-03-25 15:41:53,444][__main__][INFO] - agents played in iteration 102 are Bob, Alice [2026-03-25 15:41:54,528][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:41:54,589][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:41:54,589][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:41:54,590][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:41:55,479][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:41:56,093][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:41:56,754][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:41:57,414][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:41:58,073][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:41:58,733][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:41:59,393][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:42:00,052][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:42:00,713][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:42:01,371][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:42:02,030][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:42:02,689][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:42:03,348][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:42:04,008][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:42:04,667][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:42:05,325][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:42:05,984][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:42:06,643][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:42:07,302][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:42:07,962][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:42:08,621][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:42:09,279][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:42:09,938][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:42:10,597][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:42:11,256][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:42:11,915][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:42:12,575][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:42:13,234][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:42:13,893][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:42:14,551][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:42:15,210][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:42:15,869][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:42:16,528][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:42:17,188][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:42:17,846][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:42:18,505][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:42:19,164][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:42:19,823][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:42:20,482][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:42:21,140][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:42:21,799][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:42:22,458][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:42:23,116][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:42:23,775][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:42:24,434][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:42:25,093][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:42:25,752][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:42:26,410][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:42:27,424][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:42:28,083][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:42:28,741][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:42:29,400][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:42:30,058][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:42:30,717][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:42:31,377][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:42:32,036][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:42:32,694][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:42:33,353][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:42:34,011][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:42:34,670][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:42:35,328][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:42:35,986][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:42:36,645][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:42:37,304][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:42:37,964][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:42:38,739][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:42:40,033][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:42:40,036][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:42:40,037][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:42:41,515][__main__][INFO] - Iteration 103 took 57s (16.92% Gen, 80.52% Train). Generation: 9s, Training: 46s. Estimated remaining time: 14h 27m 47s. Estimated total time: 16h 4m 23s. Time estimates for 10 more iterations: 9m 38s, 100 more iterations: 1h 36m 26s, 500 more iterations: 8h 2m 11s. [2026-03-25 15:42:41,518][__main__][INFO] - Starting iteration 103. [2026-03-25 15:42:41,522][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 15:42:41,523][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:42:46,770][__main__][INFO] - Number of regex retries in iteration 103: 0 [2026-03-25 15:42:46,770][__main__][INFO] - agents played in iteration 103 are Bob, Alice [2026-03-25 15:42:47,351][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:42:47,412][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:42:47,413][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:42:47,413][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:42:48,302][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:42:48,922][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:42:49,584][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:42:50,242][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:42:50,904][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:42:51,562][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:42:52,221][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:42:52,880][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:42:53,540][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:42:54,198][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:42:54,855][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:42:55,513][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:42:56,171][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:42:56,832][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:42:57,489][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:42:58,147][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:42:58,806][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:42:59,464][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:43:00,122][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:43:00,780][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:43:01,438][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:43:02,095][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:43:02,754][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:43:03,412][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:43:04,071][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:43:04,729][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:43:05,388][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:43:06,047][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:43:06,705][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:43:07,364][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:43:08,021][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:43:08,679][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:43:09,336][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:43:09,994][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:43:10,653][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:43:11,310][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:43:11,969][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:43:12,627][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:43:13,286][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:43:13,944][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:43:14,602][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:43:15,263][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:43:15,921][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:43:16,579][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:43:17,238][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:43:17,895][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:43:18,553][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:43:19,211][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:43:20,196][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:43:20,855][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:43:21,515][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:43:22,174][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:43:22,832][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:43:23,493][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:43:24,153][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:43:24,812][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:43:25,470][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:43:26,131][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:43:26,791][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:43:27,448][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:43:28,107][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:43:28,766][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:43:29,426][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:43:30,084][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:43:30,742][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:43:31,452][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:43:32,796][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:43:32,798][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:43:32,800][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:43:34,605][__main__][INFO] - Iteration 104 took 53s (9.89% Gen, 86.71% Train). Generation: 5s, Training: 46s. Estimated remaining time: 13h 7m 15s. Estimated total time: 14h 44m 44s. Time estimates for 10 more iterations: 8m 50s, 100 more iterations: 1h 28m 28s, 500 more iterations: 7h 22m 22s. [2026-03-25 15:43:34,607][__main__][INFO] - Starting iteration 104. [2026-03-25 15:43:34,610][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 15:43:34,611][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:43:39,944][__main__][INFO] - Number of regex retries in iteration 104: 0 [2026-03-25 15:43:39,945][__main__][INFO] - agents played in iteration 104 are Bob, Alice [2026-03-25 15:43:40,411][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:43:40,471][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:43:40,472][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:43:40,472][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:43:41,291][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:43:41,912][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:43:42,575][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:43:43,234][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:43:43,894][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:43:44,553][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:43:45,212][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:43:45,874][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:43:46,533][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:43:47,192][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:43:47,854][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:43:48,512][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:43:49,172][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:43:49,831][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:43:50,492][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:43:51,152][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:43:51,811][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:43:52,471][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:43:53,130][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:43:53,791][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:43:54,450][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:43:55,109][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:43:55,769][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:43:56,427][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:43:57,086][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:43:57,746][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:43:58,406][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:43:59,066][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:43:59,727][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:44:00,388][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:44:01,048][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:44:01,708][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:44:02,367][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:44:03,025][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:44:03,685][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:44:04,345][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:44:05,003][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:44:05,662][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:44:06,321][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:44:06,981][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:44:07,640][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:44:08,299][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:44:08,958][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:44:09,618][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:44:10,279][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:44:10,939][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:44:11,599][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:44:12,263][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:44:13,249][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:44:13,909][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:44:14,568][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:44:15,229][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:44:15,887][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:44:16,546][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:44:17,205][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:44:17,865][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:44:18,523][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:44:19,181][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:44:19,841][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:44:20,501][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:44:21,160][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:44:21,819][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:44:22,478][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:44:23,136][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:44:23,796][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:44:24,627][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:44:26,120][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:44:26,123][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:44:26,124][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:44:27,765][__main__][INFO] - Iteration 105 took 53s (10.04% Gen, 86.87% Train). Generation: 5s, Training: 46s. Estimated remaining time: 13h 7m 33s. Estimated total time: 14h 45m 55s. Time estimates for 10 more iterations: 8m 51s, 100 more iterations: 1h 28m 35s, 500 more iterations: 7h 22m 57s. [2026-03-25 15:44:27,767][__main__][INFO] - Starting iteration 105. [2026-03-25 15:44:27,770][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 15:44:27,771][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:44:32,670][__main__][INFO] - Number of regex retries in iteration 105: 0 [2026-03-25 15:44:32,671][__main__][INFO] - agents played in iteration 105 are Bob, Alice [2026-03-25 15:44:33,245][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:44:33,307][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:44:33,307][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:44:33,308][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:44:33,989][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:44:34,610][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:44:35,270][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:44:35,930][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:44:36,590][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:44:37,249][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:44:37,908][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:44:38,568][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:44:39,228][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:44:39,888][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:44:40,548][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:44:41,207][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:44:41,868][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:44:42,529][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:44:43,188][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:44:43,849][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:44:44,508][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:44:45,168][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:44:45,828][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:44:46,486][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:44:47,147][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:44:47,807][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:44:48,467][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:44:49,127][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:44:49,786][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:44:50,445][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:44:51,106][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:44:51,767][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:44:52,426][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:44:53,084][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:44:53,743][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:44:54,403][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:44:55,062][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:44:55,723][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:44:56,384][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:44:57,044][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:44:57,704][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:44:58,363][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:44:59,023][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:44:59,682][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:45:00,342][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:45:01,001][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:45:01,660][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:45:02,319][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:45:02,978][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:45:03,638][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:45:04,298][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:45:04,956][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:45:05,942][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:45:06,601][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:45:07,259][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:45:07,920][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:45:08,581][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:45:09,240][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:45:09,898][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:45:10,557][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:45:11,217][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:45:11,878][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:45:12,535][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:45:13,194][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:45:13,853][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:45:14,512][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:45:15,171][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:45:15,831][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:45:16,490][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:45:17,325][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:45:18,676][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:45:18,679][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:45:18,680][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:45:20,157][__main__][INFO] - Iteration 106 took 52s (9.35% Gen, 87.82% Train). Generation: 4s, Training: 46s. Estimated remaining time: 12h 53m 54s. Estimated total time: 14h 33m 8s. Time estimates for 10 more iterations: 8m 43s, 100 more iterations: 1h 27m 18s, 500 more iterations: 7h 16m 34s. [2026-03-25 15:45:20,160][__main__][INFO] - Starting iteration 106. [2026-03-25 15:45:20,166][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 15:45:20,166][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:45:25,270][__main__][INFO] - Number of regex retries in iteration 106: 0 [2026-03-25 15:45:25,271][__main__][INFO] - agents played in iteration 106 are Bob, Alice [2026-03-25 15:45:25,775][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:45:25,837][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:45:25,837][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:45:25,838][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:45:26,567][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:45:27,173][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:45:27,833][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:45:28,495][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:45:29,155][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:45:29,816][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:45:30,477][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:45:31,135][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:45:31,795][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:45:32,454][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:45:33,116][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:45:33,775][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:45:34,435][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:45:35,095][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:45:35,755][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:45:36,414][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:45:37,074][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:45:37,735][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:45:38,395][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:45:39,054][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:45:39,715][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:45:40,377][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:45:41,036][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:45:41,693][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:45:42,352][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:45:43,011][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:45:43,671][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:45:44,330][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:45:44,989][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:45:45,648][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:45:46,307][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:45:46,966][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:45:47,625][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:45:48,284][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:45:48,943][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:45:49,601][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:45:50,260][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:45:50,920][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:45:51,579][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:45:52,237][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:45:52,896][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:45:53,554][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:45:54,214][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:45:54,873][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:45:55,532][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:45:56,191][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:45:56,851][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:45:57,509][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:45:58,489][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:45:59,148][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:45:59,806][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:46:00,463][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:46:01,123][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:46:01,782][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:46:02,444][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:46:03,103][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:46:03,762][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:46:04,419][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:46:05,078][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:46:05,736][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:46:06,395][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:46:07,054][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:46:07,713][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:46:08,371][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:46:09,030][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:46:09,795][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:46:11,352][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:46:11,355][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:46:11,356][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:46:12,758][__main__][INFO] - Iteration 107 took 52s (9.71% Gen, 87.62% Train). Generation: 5s, Training: 46s. Estimated remaining time: 12h 56m 28s. Estimated total time: 14h 36m 36s. Time estimates for 10 more iterations: 8m 45s, 100 more iterations: 1h 27m 39s, 500 more iterations: 7h 18m 18s. [2026-03-25 15:46:12,760][__main__][INFO] - Starting iteration 107. [2026-03-25 15:46:12,764][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 15:46:12,765][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:46:19,620][__main__][INFO] - Number of regex retries in iteration 107: 0 [2026-03-25 15:46:19,621][__main__][INFO] - agents played in iteration 107 are Bob, Alice [2026-03-25 15:46:20,687][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:46:20,749][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:46:20,749][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:46:20,750][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:46:21,460][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:46:22,075][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:46:22,735][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:46:23,396][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:46:24,057][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:46:24,718][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:46:25,379][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:46:26,038][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:46:26,697][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:46:27,356][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:46:28,014][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:46:28,674][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:46:29,333][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:46:29,991][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:46:30,649][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:46:31,308][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:46:31,967][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:46:32,625][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:46:33,284][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:46:33,943][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:46:34,601][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:46:35,259][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:46:35,918][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:46:36,576][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:46:37,236][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:46:37,895][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:46:38,554][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:46:39,213][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:46:39,872][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:46:40,531][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:46:41,189][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:46:41,848][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:46:42,511][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:46:43,170][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:46:43,829][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:46:44,487][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:46:45,146][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:46:45,805][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:46:46,464][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:46:47,122][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:46:47,781][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:46:48,439][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:46:49,098][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:46:49,757][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:46:50,415][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:46:51,074][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:46:51,733][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:46:52,391][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:46:53,372][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:46:54,030][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:46:54,689][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:46:55,347][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:46:56,006][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:46:56,664][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:46:57,321][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:46:57,979][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:46:58,637][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:46:59,294][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:46:59,952][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:47:00,611][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:47:01,269][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:47:01,928][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:47:02,585][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:47:03,243][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:47:03,901][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:47:04,663][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:47:06,466][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:47:06,468][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:47:06,469][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:47:07,995][__main__][INFO] - Iteration 108 took 55s (12.41% Gen, 84.82% Train). Generation: 6s, Training: 46s. Estimated remaining time: 13h 39m 30s. Estimated total time: 15h 20m 32s. Time estimates for 10 more iterations: 9m 12s, 100 more iterations: 1h 32m 3s, 500 more iterations: 7h 40m 16s. [2026-03-25 15:47:07,998][__main__][INFO] - Starting iteration 108. [2026-03-25 15:47:08,003][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 15:47:08,003][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:47:13,764][__main__][INFO] - Number of regex retries in iteration 108: 0 [2026-03-25 15:47:13,766][__main__][INFO] - agents played in iteration 108 are Bob, Alice [2026-03-25 15:47:14,788][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:47:14,850][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:47:14,850][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:47:14,851][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:47:15,681][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:47:16,304][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:47:16,966][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:47:17,626][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:47:18,285][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:47:18,943][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:47:19,603][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:47:20,264][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:47:20,924][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:47:21,584][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:47:22,242][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:47:22,901][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:47:23,560][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:47:24,218][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:47:24,878][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:47:25,536][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:47:26,194][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:47:26,853][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:47:27,512][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:47:28,171][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:47:28,830][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:47:29,490][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:47:30,150][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:47:30,812][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:47:31,472][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:47:32,130][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:47:32,789][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:47:33,448][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:47:34,107][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:47:34,765][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:47:35,423][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:47:36,083][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:47:36,742][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:47:37,400][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:47:38,058][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:47:38,717][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:47:39,376][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:47:40,034][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:47:40,693][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:47:41,352][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:47:42,011][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:47:42,669][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:47:43,328][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:47:43,988][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:47:44,647][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:47:45,306][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:47:45,965][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:47:46,624][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:47:47,606][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:47:48,266][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:47:48,925][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:47:49,583][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:47:50,241][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:47:50,900][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:47:51,562][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:47:52,220][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:47:52,881][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:47:53,540][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:47:54,200][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:47:54,860][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:47:55,519][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:47:56,178][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:47:56,837][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:47:57,496][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:47:58,157][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:47:58,945][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:48:00,338][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:48:00,340][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:48:00,342][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:48:01,866][__main__][INFO] - Iteration 109 took 53s (10.70% Gen, 86.47% Train). Generation: 5s, Training: 46s. Estimated remaining time: 13h 15m 49s. Estimated total time: 14h 57m 45s. Time estimates for 10 more iterations: 8m 58s, 100 more iterations: 1h 29m 46s, 500 more iterations: 7h 28m 52s. [2026-03-25 15:48:01,869][__main__][INFO] - Starting iteration 109. [2026-03-25 15:48:01,874][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 15:48:01,875][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:48:07,705][__main__][INFO] - Number of regex retries in iteration 109: 0 [2026-03-25 15:48:07,706][__main__][INFO] - agents played in iteration 109 are Bob, Alice [2026-03-25 15:48:08,229][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:48:08,293][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:48:08,294][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:48:08,294][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:48:08,999][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:48:09,620][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:48:10,280][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:48:10,941][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:48:11,601][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:48:12,261][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:48:12,921][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:48:13,582][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:48:14,243][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:48:14,903][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:48:15,563][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:48:16,221][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:48:16,882][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:48:17,540][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:48:18,199][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:48:18,858][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:48:19,517][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:48:20,176][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:48:20,836][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:48:21,495][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:48:22,154][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:48:22,813][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:48:23,475][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:48:24,134][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:48:24,792][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:48:25,451][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:48:26,110][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:48:26,769][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:48:27,429][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:48:28,088][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:48:28,747][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:48:29,407][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:48:30,068][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:48:30,727][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:48:31,387][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:48:32,046][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:48:32,704][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:48:33,364][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:48:34,022][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:48:34,681][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:48:35,339][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:48:35,999][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:48:36,658][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:48:37,317][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:48:37,977][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:48:38,636][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:48:39,295][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:48:39,955][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:48:40,935][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:48:41,594][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:48:42,252][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:48:42,913][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:48:43,574][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:48:44,234][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:48:44,894][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:48:45,557][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:48:46,217][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:48:46,876][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:48:47,535][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:48:48,194][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:48:48,854][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:48:49,512][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:48:50,170][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:48:50,830][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:48:51,488][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:48:52,330][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:48:53,705][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:48:53,707][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:48:53,709][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:48:55,255][__main__][INFO] - Iteration 110 took 53s (10.92% Gen, 86.17% Train). Generation: 5s, Training: 46s. Estimated remaining time: 13h 6m 52s. Estimated total time: 14h 49m 42s. Time estimates for 10 more iterations: 8m 53s, 100 more iterations: 1h 28m 58s, 500 more iterations: 7h 24m 51s. [2026-03-25 15:48:55,257][__main__][INFO] - Starting iteration 110. [2026-03-25 15:48:55,261][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 15:48:55,262][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:49:03,437][__main__][INFO] - Number of regex retries in iteration 110: 0 [2026-03-25 15:49:03,439][__main__][INFO] - agents played in iteration 110 are Bob, Alice [2026-03-25 15:49:03,898][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:49:03,960][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:49:03,960][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:49:03,961][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:49:04,655][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:49:05,277][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:49:05,934][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:49:06,592][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:49:07,250][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:49:07,908][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:49:08,566][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:49:09,227][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:49:09,885][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:49:10,543][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:49:11,202][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:49:11,860][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:49:12,518][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:49:13,176][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:49:13,834][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:49:14,493][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:49:15,151][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:49:15,809][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:49:16,466][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:49:17,124][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:49:17,783][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:49:18,441][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:49:19,099][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:49:19,757][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:49:20,414][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:49:21,072][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:49:21,730][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:49:22,388][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:49:23,045][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:49:23,703][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:49:24,362][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:49:25,021][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:49:25,680][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:49:26,338][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:49:26,996][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:49:27,653][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:49:28,312][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:49:28,970][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:49:29,627][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:49:30,284][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:49:30,942][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:49:31,600][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:49:32,258][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:49:32,916][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:49:33,574][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:49:34,232][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:49:34,891][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:49:35,549][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:49:36,547][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:49:37,206][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:49:37,864][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:49:38,523][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:49:39,182][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:49:39,840][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:49:40,499][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:49:41,157][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:49:41,817][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:49:42,477][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:49:43,135][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:49:43,794][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:49:44,453][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:49:45,112][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:49:45,771][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:49:46,429][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:49:47,090][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:49:47,866][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:49:49,207][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:49:49,210][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:49:49,211][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:49:50,960][__main__][INFO] - Iteration 111 took 55s (14.68% Gen, 82.17% Train). Generation: 8s, Training: 45s. Estimated remaining time: 13h 44m 35s. Estimated total time: 15h 28m 21s. Time estimates for 10 more iterations: 9m 17s, 100 more iterations: 1h 32m 50s, 500 more iterations: 7h 44m 10s. [2026-03-25 15:49:50,962][__main__][INFO] - Starting iteration 111. [2026-03-25 15:49:50,967][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 15:49:50,967][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:49:56,749][__main__][INFO] - Number of regex retries in iteration 111: 0 [2026-03-25 15:49:56,750][__main__][INFO] - agents played in iteration 111 are Bob, Alice [2026-03-25 15:49:57,319][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:49:57,380][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:49:57,381][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:49:57,382][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:49:58,033][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:49:58,681][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:49:59,342][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:50:00,004][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:50:00,664][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:50:01,325][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:50:01,983][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:50:02,641][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:50:03,299][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:50:03,957][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:50:04,615][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:50:05,275][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:50:05,932][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:50:06,590][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:50:07,250][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:50:07,907][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:50:08,566][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:50:09,225][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:50:09,888][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:50:10,547][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:50:11,207][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:50:11,865][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:50:12,526][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:50:13,185][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:50:13,844][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:50:14,502][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:50:15,161][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:50:15,821][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:50:16,479][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:50:17,137][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:50:17,796][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:50:18,455][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:50:19,119][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:50:19,778][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:50:20,436][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:50:21,094][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:50:21,753][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:50:22,410][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:50:23,069][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:50:23,727][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:50:24,384][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:50:25,042][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:50:25,700][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:50:26,358][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:50:27,016][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:50:27,674][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:50:28,333][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:50:28,990][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:50:29,977][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:50:30,635][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:50:31,294][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:50:31,951][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:50:32,610][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:50:33,267][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:50:33,926][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:50:34,587][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:50:35,245][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:50:35,904][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:50:36,562][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:50:37,221][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:50:37,879][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:50:38,537][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:50:39,195][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:50:39,854][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:50:40,512][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:50:41,253][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:50:42,609][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:50:42,612][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:50:42,613][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:50:44,060][__main__][INFO] - Iteration 112 took 53s (10.89% Gen, 86.38% Train). Generation: 5s, Training: 45s. Estimated remaining time: 13h 0m 16s. Estimated total time: 14h 44m 55s. Time estimates for 10 more iterations: 8m 50s, 100 more iterations: 1h 28m 29s, 500 more iterations: 7h 22m 27s. [2026-03-25 15:50:44,062][__main__][INFO] - Starting iteration 112. [2026-03-25 15:50:44,066][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 15:50:44,067][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:50:48,912][__main__][INFO] - Number of regex retries in iteration 112: 0 [2026-03-25 15:50:48,913][__main__][INFO] - agents played in iteration 112 are Bob, Alice [2026-03-25 15:50:49,367][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:50:49,428][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:50:49,429][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:50:49,429][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:50:50,131][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:50:50,752][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:50:51,411][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:50:52,070][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:50:52,729][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:50:53,389][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:50:54,048][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:50:54,707][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:50:55,368][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:50:56,028][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:50:56,687][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:50:57,346][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:50:58,008][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:50:58,668][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:50:59,327][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:50:59,987][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:51:00,646][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:51:01,305][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:51:01,965][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:51:02,625][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:51:03,284][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:51:03,943][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:51:04,602][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:51:05,261][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:51:05,921][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:51:06,580][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:51:07,239][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:51:07,898][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:51:08,557][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:51:09,217][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:51:09,877][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:51:10,536][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:51:11,196][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:51:11,857][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:51:12,517][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:51:13,176][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:51:13,835][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:51:14,494][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:51:15,153][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:51:15,812][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:51:16,471][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:51:17,130][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:51:17,790][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:51:18,449][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:51:19,110][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:51:19,770][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:51:20,430][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:51:21,091][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:51:22,072][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:51:22,732][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:51:23,393][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:51:24,051][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:51:24,710][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:51:25,368][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:51:26,026][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:51:26,685][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:51:27,343][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:51:28,001][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:51:28,659][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:51:29,318][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:51:29,976][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:51:30,634][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:51:31,294][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:51:31,952][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:51:32,612][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:51:33,369][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:51:34,695][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:51:34,697][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:51:34,699][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:51:36,196][__main__][INFO] - Iteration 113 took 52s (9.30% Gen, 87.83% Train). Generation: 4s, Training: 45s. Estimated remaining time: 12h 43m 21s. Estimated total time: 14h 28m 52s. Time estimates for 10 more iterations: 8m 41s, 100 more iterations: 1h 26m 53s, 500 more iterations: 7h 14m 26s. [2026-03-25 15:51:36,199][__main__][INFO] - Starting iteration 113. [2026-03-25 15:51:36,203][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 15:51:36,204][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:51:41,463][__main__][INFO] - Number of regex retries in iteration 113: 0 [2026-03-25 15:51:41,464][__main__][INFO] - agents played in iteration 113 are Bob, Alice [2026-03-25 15:51:42,526][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:51:42,590][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:51:42,590][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:51:42,591][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:51:43,306][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:51:43,924][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:51:44,585][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:51:45,245][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:51:45,906][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:51:46,568][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:51:47,228][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:51:47,890][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:51:48,550][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:51:49,210][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:51:49,869][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:51:50,530][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:51:51,190][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:51:51,849][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:51:52,508][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:51:53,167][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:51:53,826][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:51:54,485][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:51:55,145][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:51:55,803][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:51:56,462][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:51:57,120][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:51:57,778][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:51:58,438][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:51:59,097][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:51:59,756][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:52:00,415][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:52:01,074][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:52:01,733][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:52:02,392][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:52:03,053][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:52:03,712][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:52:04,371][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:52:05,029][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:52:05,687][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:52:06,345][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:52:07,004][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:52:07,663][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:52:08,321][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:52:08,980][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:52:09,639][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:52:10,299][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:52:10,958][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:52:11,616][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:52:12,276][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:52:12,935][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:52:13,595][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:52:14,254][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:52:15,237][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:52:15,900][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:52:16,558][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:52:17,217][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:52:17,877][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:52:18,536][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:52:19,197][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:52:19,855][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:52:20,516][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:52:21,174][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:52:21,835][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:52:22,493][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:52:23,155][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:52:23,814][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:52:24,472][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:52:25,131][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:52:25,790][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:52:26,500][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:52:27,929][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:52:27,932][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:52:27,933][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:52:29,748][__main__][INFO] - Iteration 114 took 53s (9.82% Gen, 86.78% Train). Generation: 5s, Training: 46s. Estimated remaining time: 13h 6m 2s. Estimated total time: 14h 52m 26s. Time estimates for 10 more iterations: 8m 55s, 100 more iterations: 1h 29m 14s, 500 more iterations: 7h 26m 13s. [2026-03-25 15:52:29,750][__main__][INFO] - Starting iteration 114. [2026-03-25 15:52:29,754][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 15:52:29,755][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:52:35,488][__main__][INFO] - Number of regex retries in iteration 114: 0 [2026-03-25 15:52:35,489][__main__][INFO] - agents played in iteration 114 are Bob, Alice [2026-03-25 15:52:36,327][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:52:36,388][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:52:36,389][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:52:36,389][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:52:37,190][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:52:37,821][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:52:38,473][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:52:39,133][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:52:39,793][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:52:40,453][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:52:41,113][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:52:41,772][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:52:42,432][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:52:43,093][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:52:43,753][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:52:44,412][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:52:45,071][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:52:45,731][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:52:46,391][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:52:47,051][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:52:47,712][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:52:48,371][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:52:49,030][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:52:49,688][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:52:50,348][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:52:51,007][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:52:51,667][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:52:52,326][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:52:52,987][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:52:53,646][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:52:54,306][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:52:54,968][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:52:55,627][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:52:56,286][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:52:56,947][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:52:57,603][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:52:58,262][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:52:58,922][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:52:59,581][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:53:00,240][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:53:00,899][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:53:01,558][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:53:02,217][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:53:02,876][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:53:03,535][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:53:04,196][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:53:04,855][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:53:05,515][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:53:06,172][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:53:06,833][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:53:07,492][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:53:08,151][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:53:09,164][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:53:09,822][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:53:10,481][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:53:11,139][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:53:11,798][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:53:12,456][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:53:13,115][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:53:13,775][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:53:14,434][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:53:15,092][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:53:15,751][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:53:16,410][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:53:17,070][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:53:17,729][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:53:18,390][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:53:19,050][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:53:19,709][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:53:20,569][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:53:21,977][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:53:21,980][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:53:21,982][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:53:23,431][__main__][INFO] - Iteration 115 took 53s (10.68% Gen, 86.61% Train). Generation: 5s, Training: 46s. Estimated remaining time: 13h 7m 20s. Estimated total time: 14h 54m 38s. Time estimates for 10 more iterations: 8m 56s, 100 more iterations: 1h 29m 27s, 500 more iterations: 7h 27m 19s. [2026-03-25 15:53:23,434][__main__][INFO] - Starting iteration 115. [2026-03-25 15:53:23,438][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 15:53:23,439][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:53:33,290][__main__][INFO] - Number of regex retries in iteration 115: 0 [2026-03-25 15:53:33,291][__main__][INFO] - agents played in iteration 115 are Bob, Alice [2026-03-25 15:53:34,162][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:53:34,224][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:53:34,224][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:53:34,225][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:53:35,091][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:53:35,777][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:53:36,438][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:53:37,096][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:53:37,754][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:53:38,411][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:53:39,074][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:53:39,729][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:53:40,387][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:53:41,046][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:53:41,704][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:53:42,366][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:53:43,024][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:53:43,683][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:53:44,341][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:53:45,000][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:53:45,658][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:53:46,316][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:53:46,974][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:53:47,632][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:53:48,290][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:53:48,949][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:53:49,606][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:53:50,264][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:53:50,921][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:53:51,579][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:53:52,238][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:53:52,898][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:53:53,554][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:53:54,212][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:53:54,870][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:53:55,529][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:53:56,187][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:53:56,845][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:53:57,503][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:53:58,160][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:53:58,818][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:53:59,478][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:54:00,136][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:54:00,793][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:54:01,451][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:54:02,114][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:54:02,767][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:54:03,425][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:54:04,084][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:54:04,742][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:54:05,400][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:54:06,058][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:54:07,043][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:54:07,701][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:54:08,359][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:54:09,017][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:54:09,675][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:54:10,334][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:54:10,992][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:54:11,650][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:54:12,309][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:54:12,967][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:54:13,625][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:54:14,283][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:54:14,943][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:54:15,601][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:54:16,260][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:54:16,917][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:54:17,576][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:54:18,529][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:54:19,981][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:54:19,984][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:54:19,985][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:54:22,025][__main__][INFO] - Iteration 116 took 58s (16.82% Gen, 79.70% Train). Generation: 9s, Training: 46s. Estimated remaining time: 14h 28m 11s. Estimated total time: 16h 16m 28s. Time estimates for 10 more iterations: 9m 45s, 100 more iterations: 1h 37m 38s, 500 more iterations: 8h 8m 14s. [2026-03-25 15:54:22,027][__main__][INFO] - Starting iteration 116. [2026-03-25 15:54:22,031][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 15:54:22,032][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:54:31,033][__main__][INFO] - Number of regex retries in iteration 116: 0 [2026-03-25 15:54:31,035][__main__][INFO] - agents played in iteration 116 are Bob, Alice [2026-03-25 15:54:31,602][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:54:31,664][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:54:31,664][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:54:31,665][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:54:32,410][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:54:33,020][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:54:33,678][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:54:34,337][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:54:34,993][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:54:35,650][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:54:36,308][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:54:36,965][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:54:37,623][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:54:38,280][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:54:38,938][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:54:39,597][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:54:40,255][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:54:40,911][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:54:41,569][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:54:42,226][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:54:42,884][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:54:43,542][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:54:44,199][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:54:44,858][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:54:45,515][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:54:46,173][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:54:46,831][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:54:47,489][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:54:48,146][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:54:48,803][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:54:49,462][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:54:50,119][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:54:50,776][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:54:51,435][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:54:52,095][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:54:52,750][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:54:53,408][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:54:54,065][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:54:54,722][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:54:55,380][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:54:56,037][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:54:56,695][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:54:57,353][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:54:58,010][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:54:58,668][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:54:59,325][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:54:59,985][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:55:00,641][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:55:01,299][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:55:01,956][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:55:02,613][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:55:03,270][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:55:04,257][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:55:04,916][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:55:05,573][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:55:06,231][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:55:06,888][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:55:07,547][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:55:08,204][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:55:08,862][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:55:09,519][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:55:10,177][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:55:10,836][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:55:11,493][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:55:12,153][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:55:12,810][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:55:13,470][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:55:14,125][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:55:14,783][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:55:15,572][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:55:16,921][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:55:16,924][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:55:16,926][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:55:18,365][__main__][INFO] - Iteration 117 took 56s (15.98% Gen, 81.46% Train). Generation: 9s, Training: 45s. Estimated remaining time: 13h 49m 43s. Estimated total time: 15h 38m 56s. Time estimates for 10 more iterations: 9m 23s, 100 more iterations: 1h 33m 53s, 500 more iterations: 7h 49m 28s. [2026-03-25 15:55:18,368][__main__][INFO] - Starting iteration 117. [2026-03-25 15:55:18,373][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 15:55:18,373][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:55:25,561][__main__][INFO] - Number of regex retries in iteration 117: 0 [2026-03-25 15:55:25,563][__main__][INFO] - agents played in iteration 117 are Bob, Alice [2026-03-25 15:55:26,077][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:55:26,139][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:55:26,139][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:55:26,140][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:55:26,904][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:55:27,515][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:55:28,174][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:55:28,831][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:55:29,487][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:55:30,144][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:55:30,801][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:55:31,458][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:55:32,115][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:55:32,775][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:55:33,432][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:55:34,091][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:55:34,748][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:55:35,407][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:55:36,068][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:55:36,727][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:55:37,384][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:55:38,040][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:55:38,698][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:55:39,355][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:55:40,012][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:55:40,671][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:55:41,329][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:55:41,988][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:55:42,645][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:55:43,304][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:55:43,964][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:55:44,620][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:55:45,280][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:55:45,936][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:55:46,593][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:55:47,251][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:55:47,910][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:55:48,568][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:55:49,225][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:55:49,884][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:55:50,541][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:55:51,199][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:55:51,857][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:55:52,515][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:55:53,172][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:55:53,831][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:55:54,491][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:55:55,146][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:55:55,803][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:55:56,461][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:55:57,119][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:55:57,776][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:55:58,762][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:55:59,420][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:56:00,077][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:56:00,734][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:56:01,392][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:56:02,050][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:56:02,707][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:56:03,365][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:56:04,022][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:56:04,680][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:56:05,340][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:56:05,997][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:56:06,656][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:56:07,313][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:56:07,971][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:56:08,629][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:56:09,287][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:56:10,079][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:56:11,626][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:56:11,629][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:56:11,631][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:56:13,241][__main__][INFO] - Iteration 118 took 54s (13.10% Gen, 83.96% Train). Generation: 7s, Training: 46s. Estimated remaining time: 13h 24m 22s. Estimated total time: 15h 14m 30s. Time estimates for 10 more iterations: 9m 8s, 100 more iterations: 1h 31m 27s, 500 more iterations: 7h 37m 15s. [2026-03-25 15:56:13,243][__main__][INFO] - Starting iteration 118. [2026-03-25 15:56:13,247][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 15:56:13,248][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:56:18,398][__main__][INFO] - Number of regex retries in iteration 118: 0 [2026-03-25 15:56:18,399][__main__][INFO] - agents played in iteration 118 are Bob, Alice [2026-03-25 15:56:19,454][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:56:19,515][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:56:19,516][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:56:19,516][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:56:20,222][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:56:20,845][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:56:21,504][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:56:22,160][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:56:22,818][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:56:23,476][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:56:24,134][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:56:24,792][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:56:25,450][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:56:26,110][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:56:26,766][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:56:27,424][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:56:28,082][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:56:28,740][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:56:29,399][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:56:30,056][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:56:30,716][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:56:31,371][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:56:32,028][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:56:32,688][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:56:33,346][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:56:34,003][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:56:34,661][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:56:35,318][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:56:35,978][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:56:36,634][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:56:37,292][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:56:37,949][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:56:38,606][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:56:39,263][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:56:39,923][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:56:40,579][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:56:41,237][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:56:41,895][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:56:42,553][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:56:43,210][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:56:43,867][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:56:44,527][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:56:45,185][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:56:45,843][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:56:46,500][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:56:47,159][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:56:47,817][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:56:48,474][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:56:49,133][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:56:49,791][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:56:50,448][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:56:51,106][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:56:52,100][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:56:52,755][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:56:53,413][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:56:54,071][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:56:54,728][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:56:55,386][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:56:56,045][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:56:56,702][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:56:57,360][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:56:58,020][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:56:58,675][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:56:59,336][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:56:59,997][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:57:00,653][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:57:01,310][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:57:01,968][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:57:02,625][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:57:03,402][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:57:04,715][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:57:04,719][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:57:04,720][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:57:06,184][__main__][INFO] - Iteration 119 took 52s (9.73% Gen, 87.50% Train). Generation: 5s, Training: 46s. Estimated remaining time: 12h 51m 17s. Estimated total time: 14h 42m 18s. Time estimates for 10 more iterations: 8m 49s, 100 more iterations: 1h 28m 13s, 500 more iterations: 7h 21m 9s. [2026-03-25 15:57:06,187][__main__][INFO] - Starting iteration 119. [2026-03-25 15:57:06,192][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 15:57:06,192][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:57:14,030][__main__][INFO] - Number of regex retries in iteration 119: 0 [2026-03-25 15:57:14,032][__main__][INFO] - agents played in iteration 119 are Bob, Alice [2026-03-25 15:57:14,855][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:57:14,916][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:57:14,917][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:57:14,918][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:57:15,567][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:57:16,187][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:57:16,846][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:57:17,503][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:57:18,160][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:57:18,818][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:57:19,476][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:57:20,134][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:57:20,790][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:57:21,448][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:57:22,107][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:57:22,762][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:57:23,419][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:57:24,077][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:57:24,735][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:57:25,393][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:57:26,051][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:57:26,709][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:57:27,367][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:57:28,024][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:57:28,684][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:57:29,340][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:57:29,997][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:57:30,654][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:57:31,311][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:57:31,970][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:57:32,627][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:57:33,285][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:57:33,944][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:57:34,601][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:57:35,259][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:57:35,916][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:57:36,573][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:57:37,231][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:57:37,888][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:57:38,547][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:57:39,204][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:57:39,863][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:57:40,521][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:57:41,178][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:57:41,835][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:57:42,493][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:57:43,150][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:57:43,808][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:57:44,466][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:57:45,123][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:57:45,782][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:57:46,439][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:57:47,424][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:57:48,082][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:57:48,742][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:57:49,399][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:57:50,057][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:57:50,715][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:57:51,374][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:57:52,033][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:57:52,689][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:57:53,346][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:57:54,004][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:57:54,664][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:57:55,319][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:57:55,981][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:57:56,637][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:57:57,294][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:57:57,952][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:57:58,871][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:58:00,199][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:58:00,202][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:58:00,203][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:58:01,911][__main__][INFO] - Iteration 120 took 55s (14.07% Gen, 82.86% Train). Generation: 7s, Training: 46s. Estimated remaining time: 13h 36m 43s. Estimated total time: 15h 28m 40s. Time estimates for 10 more iterations: 9m 17s, 100 more iterations: 1h 32m 52s, 500 more iterations: 7h 44m 20s. [2026-03-25 15:58:01,912][__main__][INFO] - Starting iteration 120. [2026-03-25 15:58:01,916][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 15:58:01,917][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:58:07,573][__main__][INFO] - Number of regex retries in iteration 120: 0 [2026-03-25 15:58:07,574][__main__][INFO] - agents played in iteration 120 are Bob, Alice [2026-03-25 15:58:08,134][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:58:08,195][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:58:08,196][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:58:08,196][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:58:08,971][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:58:09,574][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:58:10,232][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:58:10,890][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:58:11,546][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:58:12,203][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:58:12,860][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:58:13,518][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:58:14,174][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:58:14,832][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:58:15,491][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:58:16,148][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:58:16,805][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:58:17,462][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:58:18,119][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:58:18,775][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:58:19,432][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:58:20,090][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:58:20,747][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:58:21,404][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:58:22,061][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:58:22,719][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:58:23,376][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:58:24,033][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:58:24,690][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:58:25,347][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:58:26,004][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:58:26,661][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:58:27,318][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:58:27,977][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:58:28,632][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:58:29,289][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:58:29,946][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:58:30,603][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:58:31,263][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:58:31,919][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:58:32,577][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:58:33,235][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:58:33,892][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:58:34,549][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:58:35,208][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:58:35,869][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:58:36,523][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:58:37,182][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:58:37,838][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:58:38,495][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:58:39,153][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:58:39,812][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:58:40,820][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:58:41,479][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:58:42,138][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:58:42,797][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:58:43,455][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:58:44,114][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:58:44,772][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:58:45,431][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:58:46,088][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:58:46,746][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:58:47,404][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:58:48,063][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:58:48,719][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:58:49,379][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:58:50,036][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:58:50,693][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:58:51,352][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:58:52,133][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:58:53,445][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:58:53,447][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:58:53,448][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:58:55,180][__main__][INFO] - Iteration 121 took 53s (10.62% Gen, 86.12% Train). Generation: 5s, Training: 45s. Estimated remaining time: 12h 54m 55s. Estimated total time: 14h 47m 45s. Time estimates for 10 more iterations: 8m 52s, 100 more iterations: 1h 28m 46s, 500 more iterations: 7h 23m 52s. [2026-03-25 15:58:55,182][__main__][INFO] - Starting iteration 121. [2026-03-25 15:58:55,186][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 15:58:55,186][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:59:00,057][__main__][INFO] - Number of regex retries in iteration 121: 0 [2026-03-25 15:59:00,058][__main__][INFO] - agents played in iteration 121 are Bob, Alice [2026-03-25 15:59:00,528][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:59:00,590][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:59:00,591][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:59:00,591][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:59:01,368][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:59:01,997][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:59:02,655][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:59:03,312][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:59:03,970][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:59:04,631][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:59:05,290][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:59:05,947][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:59:06,607][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:59:07,266][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:59:07,925][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:59:08,583][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:59:09,241][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:59:09,900][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:59:10,560][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:59:11,220][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:59:11,876][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:59:12,535][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:59:13,194][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:59:13,853][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:59:14,511][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:59:15,169][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:59:15,829][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:59:16,487][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:59:17,145][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:59:17,803][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:59:18,462][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:59:19,120][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:59:19,778][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:59:20,437][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:59:21,095][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:59:21,753][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:59:22,411][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:59:23,070][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:59:23,728][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:59:24,387][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:59:25,046][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:59:25,704][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:59:26,363][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:59:27,022][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:59:27,681][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:59:28,341][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:59:28,999][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:59:29,657][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:59:30,316][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:59:30,975][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:59:31,633][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:59:32,292][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:59:33,269][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:59:33,926][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:59:34,583][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:59:35,240][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:59:35,897][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:59:36,555][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:59:37,215][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:59:37,872][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:59:38,530][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:59:39,188][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:59:39,848][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:59:40,505][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:59:41,163][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:59:41,823][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:59:42,478][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:59:43,136][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:59:43,794][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:59:44,673][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 15:59:45,999][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:59:46,002][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:59:46,003][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:59:48,041][__main__][INFO] - Iteration 122 took 52s (9.22% Gen, 86.92% Train). Generation: 4s, Training: 45s. Estimated remaining time: 12h 47m 14s. Estimated total time: 14h 40m 57s. Time estimates for 10 more iterations: 8m 48s, 100 more iterations: 1h 28m 5s, 500 more iterations: 7h 20m 28s. [2026-03-25 15:59:48,044][__main__][INFO] - Starting iteration 122. [2026-03-25 15:59:48,048][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 15:59:48,049][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:59:55,346][__main__][INFO] - Number of regex retries in iteration 122: 0 [2026-03-25 15:59:55,347][__main__][INFO] - agents played in iteration 122 are Bob, Alice [2026-03-25 15:59:55,914][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:59:55,976][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 15:59:55,977][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:59:55,978][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:59:56,771][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:59:57,384][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:59:58,041][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:59:58,700][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:59:59,356][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:00:00,012][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:00:00,670][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:00:01,327][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:00:01,984][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:00:02,641][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:00:03,298][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:00:03,955][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:00:04,612][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:00:05,270][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:00:05,927][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:00:06,584][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:00:07,242][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:00:07,898][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:00:08,556][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:00:09,213][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:00:09,871][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:00:10,528][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:00:11,186][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:00:11,843][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:00:12,501][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:00:13,159][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:00:13,820][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:00:14,477][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:00:15,136][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:00:15,794][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:00:16,451][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:00:17,109][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:00:17,767][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:00:18,424][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:00:19,081][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:00:19,738][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:00:20,395][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:00:21,052][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:00:21,710][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:00:22,367][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:00:23,027][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:00:23,682][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:00:24,339][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:00:24,997][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:00:25,656][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:00:26,314][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:00:26,971][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:00:27,629][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:00:28,614][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:00:29,272][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:00:29,930][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:00:30,588][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:00:31,247][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:00:31,905][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:00:32,565][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:00:33,223][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:00:33,880][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:00:34,537][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:00:35,194][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:00:35,852][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:00:36,510][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:00:37,168][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:00:37,825][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:00:38,483][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:00:39,140][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:00:40,057][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 16:00:41,832][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:00:41,835][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:00:41,836][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:00:43,252][__main__][INFO] - Iteration 123 took 55s (13.22% Gen, 84.21% Train). Generation: 7s, Training: 46s. Estimated remaining time: 13h 25m 27s. Estimated total time: 15h 20m 5s. Time estimates for 10 more iterations: 9m 12s, 100 more iterations: 1h 32m 0s, 500 more iterations: 7h 40m 2s. [2026-03-25 16:00:43,254][__main__][INFO] - Starting iteration 123. [2026-03-25 16:00:43,259][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:00:43,260][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:00:49,309][__main__][INFO] - Number of regex retries in iteration 123: 0 [2026-03-25 16:00:49,310][__main__][INFO] - agents played in iteration 123 are Bob, Alice [2026-03-25 16:00:49,791][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:00:49,852][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:00:49,853][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:00:49,853][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:00:50,665][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:00:51,276][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:00:51,936][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:00:52,594][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:00:53,252][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:00:53,909][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:00:54,569][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:00:55,227][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:00:55,884][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:00:56,542][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:00:57,200][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:00:57,858][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:00:58,516][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:00:59,173][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:00:59,832][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:01:00,490][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:01:01,148][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:01:01,807][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:01:02,465][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:01:03,124][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:01:03,781][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:01:04,439][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:01:05,097][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:01:05,754][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:01:06,411][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:01:07,068][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:01:07,725][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:01:08,382][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:01:09,041][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:01:09,698][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:01:10,359][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:01:11,017][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:01:11,674][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:01:12,331][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:01:12,989][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:01:13,646][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:01:14,304][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:01:14,962][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:01:15,619][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:01:16,277][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:01:16,934][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:01:17,591][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:01:18,248][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:01:18,905][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:01:19,562][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:01:20,220][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:01:20,877][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:01:21,536][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:01:22,521][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:01:23,180][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:01:23,842][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:01:24,498][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:01:25,155][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:01:25,813][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:01:26,471][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:01:27,128][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:01:27,786][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:01:28,445][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:01:29,103][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:01:29,760][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:01:30,418][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:01:31,075][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:01:31,733][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:01:32,390][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:01:33,048][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:01:33,917][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 16:01:35,227][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:01:35,230][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:01:35,231][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:01:36,978][__main__][INFO] - Iteration 124 took 53s (11.26% Gen, 85.48% Train). Generation: 6s, Training: 45s. Estimated remaining time: 12h 59m 50s. Estimated total time: 14h 55m 21s. Time estimates for 10 more iterations: 8m 57s, 100 more iterations: 1h 29m 32s, 500 more iterations: 7h 27m 40s. [2026-03-25 16:01:36,980][__main__][INFO] - Starting iteration 124. [2026-03-25 16:01:36,985][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:01:36,986][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:01:42,353][__main__][INFO] - Number of regex retries in iteration 124: 0 [2026-03-25 16:01:42,354][__main__][INFO] - agents played in iteration 124 are Bob, Alice [2026-03-25 16:01:43,371][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:01:43,433][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:01:43,433][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:01:43,434][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:01:44,213][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:01:44,839][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:01:45,499][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:01:46,159][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:01:46,817][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:01:47,476][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:01:48,134][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:01:48,792][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:01:49,451][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:01:50,109][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:01:50,766][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:01:51,424][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:01:52,081][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:01:52,739][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:01:53,397][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:01:54,055][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:01:54,713][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:01:55,371][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:01:56,028][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:01:56,686][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:01:57,343][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:01:58,001][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:01:58,659][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:01:59,316][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:01:59,974][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:02:00,632][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:02:01,290][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:02:01,948][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:02:02,606][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:02:03,263][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:02:03,923][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:02:04,581][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:02:05,238][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:02:05,895][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:02:06,553][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:02:07,212][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:02:07,873][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:02:08,531][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:02:09,190][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:02:09,848][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:02:10,506][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:02:11,165][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:02:11,824][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:02:12,484][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:02:13,143][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:02:13,801][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:02:14,459][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:02:15,116][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:02:16,105][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:02:16,764][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:02:17,424][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:02:18,080][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:02:18,738][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:02:19,397][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:02:20,055][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:02:20,713][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:02:21,370][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:02:22,028][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:02:22,686][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:02:23,344][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:02:24,002][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:02:24,659][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:02:25,317][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:02:25,974][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:02:26,631][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:02:27,419][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 16:02:28,750][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:02:28,753][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:02:28,754][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:02:30,518][__main__][INFO] - Iteration 125 took 53s (10.03% Gen, 86.67% Train). Generation: 5s, Training: 46s. Estimated remaining time: 12h 55m 50s. Estimated total time: 14h 52m 15s. Time estimates for 10 more iterations: 8m 55s, 100 more iterations: 1h 29m 13s, 500 more iterations: 7h 26m 7s. [2026-03-25 16:02:30,521][__main__][INFO] - Starting iteration 125. [2026-03-25 16:02:30,525][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:02:30,526][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:02:36,133][__main__][INFO] - Number of regex retries in iteration 125: 0 [2026-03-25 16:02:36,134][__main__][INFO] - agents played in iteration 125 are Bob, Alice [2026-03-25 16:02:37,107][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:02:37,169][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:02:37,169][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:02:37,170][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:02:38,016][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:02:38,636][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:02:39,302][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:02:39,959][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:02:40,617][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:02:41,276][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:02:41,936][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:02:42,593][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:02:43,250][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:02:43,908][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:02:44,566][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:02:45,225][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:02:45,882][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:02:46,539][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:02:47,197][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:02:47,855][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:02:48,512][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:02:49,172][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:02:49,830][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:02:50,487][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:02:51,144][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:02:51,802][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:02:52,459][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:02:53,116][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:02:53,774][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:02:54,432][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:02:55,090][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:02:55,750][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:02:56,410][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:02:57,068][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:02:57,726][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:02:58,384][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:02:59,041][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:02:59,698][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:03:00,355][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:03:01,013][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:03:01,670][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:03:02,328][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:03:02,986][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:03:03,643][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:03:04,301][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:03:04,959][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:03:05,617][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:03:06,274][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:03:06,932][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:03:07,591][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:03:08,249][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:03:08,908][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:03:09,898][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:03:10,558][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:03:11,218][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:03:11,876][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:03:12,533][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:03:13,193][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:03:13,851][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:03:14,510][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:03:15,167][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:03:15,824][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:03:16,483][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:03:17,141][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:03:17,800][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:03:18,457][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:03:19,115][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:03:19,774][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:03:20,433][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:03:21,281][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 16:03:22,757][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:03:22,760][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:03:22,761][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:03:24,161][__main__][INFO] - Iteration 126 took 53s (10.46% Gen, 86.93% Train). Generation: 5s, Training: 46s. Estimated remaining time: 12h 56m 38s. Estimated total time: 14h 53m 57s. Time estimates for 10 more iterations: 8m 56s, 100 more iterations: 1h 29m 23s, 500 more iterations: 7h 26m 58s. [2026-03-25 16:03:24,163][__main__][INFO] - Starting iteration 126. [2026-03-25 16:03:24,168][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:03:24,169][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:03:33,614][__main__][INFO] - Number of regex retries in iteration 126: 0 [2026-03-25 16:03:33,616][__main__][INFO] - agents played in iteration 126 are Bob, Alice [2026-03-25 16:03:34,159][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:03:34,221][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:03:34,221][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:03:34,222][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:03:35,045][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:03:35,661][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:03:36,320][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:03:36,979][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:03:37,638][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:03:38,300][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:03:38,956][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:03:39,613][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:03:40,272][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:03:40,934][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:03:41,593][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:03:42,252][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:03:42,909][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:03:43,569][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:03:44,227][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:03:44,886][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:03:45,543][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:03:46,200][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:03:46,858][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:03:47,517][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:03:48,175][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:03:48,836][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:03:49,495][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:03:50,151][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:03:50,808][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:03:51,465][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:03:52,122][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:03:52,779][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:03:53,437][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:03:54,094][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:03:54,751][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:03:55,410][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:03:56,067][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:03:56,724][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:03:57,381][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:03:58,038][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:03:58,696][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:03:59,353][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:04:00,011][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:04:00,667][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:04:01,325][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:04:01,982][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:04:02,639][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:04:03,296][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:04:03,954][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:04:04,611][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:04:05,269][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:04:05,926][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:04:06,921][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:04:07,579][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:04:08,238][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:04:08,899][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:04:09,557][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:04:10,214][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:04:10,872][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:04:11,531][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:04:12,189][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:04:12,850][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:04:13,509][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:04:14,166][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:04:14,825][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:04:15,482][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:04:16,141][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:04:16,798][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:04:17,455][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:04:18,303][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 16:04:19,642][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:04:19,645][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:04:19,646][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:04:21,183][__main__][INFO] - Iteration 127 took 57s (16.57% Gen, 80.73% Train). Generation: 9s, Training: 46s. Estimated remaining time: 13h 52m 1s. Estimated total time: 15h 50m 17s. Time estimates for 10 more iterations: 9m 30s, 100 more iterations: 1h 35m 1s, 500 more iterations: 7h 55m 8s. [2026-03-25 16:04:21,186][__main__][INFO] - Starting iteration 127. [2026-03-25 16:04:21,192][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:04:21,192][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:04:26,210][__main__][INFO] - Number of regex retries in iteration 127: 0 [2026-03-25 16:04:26,211][__main__][INFO] - agents played in iteration 127 are Bob, Alice [2026-03-25 16:04:26,867][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:04:26,929][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:04:26,929][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:04:26,930][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:04:27,624][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:04:28,230][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:04:28,888][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:04:29,546][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:04:30,205][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:04:30,865][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:04:31,523][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:04:32,180][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:04:32,843][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:04:33,497][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:04:34,157][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:04:34,815][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:04:35,473][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:04:36,129][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:04:36,787][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:04:37,443][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:04:38,100][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:04:38,757][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:04:39,414][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:04:40,071][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:04:40,729][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:04:41,386][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:04:42,044][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:04:42,701][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:04:43,360][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:04:44,017][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:04:44,674][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:04:45,331][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:04:45,988][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:04:46,646][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:04:47,303][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:04:47,959][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:04:48,616][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:04:49,274][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:04:49,931][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:04:50,588][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:04:51,247][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:04:51,906][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:04:52,561][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:04:53,220][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:04:53,878][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:04:54,536][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:04:55,194][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:04:55,851][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:04:56,510][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:04:57,167][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:04:57,825][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:04:58,483][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:04:59,472][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:05:00,131][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:05:00,789][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:05:01,449][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:05:02,109][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:05:02,768][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:05:03,428][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:05:04,086][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:05:04,745][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:05:05,403][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:05:06,062][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:05:06,721][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:05:07,380][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:05:08,038][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:05:08,698][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:05:09,356][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:05:10,015][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:05:10,851][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 16:05:12,327][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:05:12,331][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:05:12,332][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:05:13,746][__main__][INFO] - Iteration 128 took 52s (9.55% Gen, 87.76% Train). Generation: 5s, Training: 46s. Estimated remaining time: 12h 36m 47s. Estimated total time: 14h 35m 55s. Time estimates for 10 more iterations: 8m 45s, 100 more iterations: 1h 27m 35s, 500 more iterations: 7h 17m 57s. [2026-03-25 16:05:13,748][__main__][INFO] - Starting iteration 128. [2026-03-25 16:05:13,753][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:05:13,753][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:05:21,018][__main__][INFO] - Number of regex retries in iteration 128: 0 [2026-03-25 16:05:21,019][__main__][INFO] - agents played in iteration 128 are Bob, Alice [2026-03-25 16:05:21,638][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:05:21,699][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:05:21,700][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:05:21,700][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:05:22,388][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:05:23,006][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:05:23,664][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:05:24,320][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:05:24,976][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:05:25,635][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:05:26,293][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:05:26,951][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:05:27,607][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:05:28,264][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:05:28,923][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:05:29,582][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:05:30,240][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:05:30,900][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:05:31,557][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:05:32,214][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:05:32,871][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:05:33,527][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:05:34,184][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:05:34,841][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:05:35,499][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:05:36,156][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:05:36,816][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:05:37,474][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:05:38,132][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:05:38,789][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:05:39,447][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:05:40,105][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:05:40,762][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:05:41,419][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:05:42,076][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:05:42,734][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:05:43,392][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:05:44,050][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:05:44,708][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:05:45,365][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:05:46,024][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:05:46,681][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:05:47,338][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:05:47,996][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:05:48,654][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:05:49,311][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:05:49,969][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:05:50,627][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:05:51,284][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:05:51,942][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:05:52,600][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:05:53,258][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:05:54,242][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:05:54,900][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:05:55,558][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:05:56,215][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:05:56,872][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:05:57,530][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:05:58,189][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:05:58,847][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:05:59,505][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:06:00,162][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:06:00,820][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:06:01,478][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:06:02,136][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:06:02,794][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:06:03,451][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:06:04,108][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:06:04,765][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:06:05,502][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 16:06:06,920][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:06:06,923][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:06:06,924][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:06:08,350][__main__][INFO] - Iteration 129 took 54s (13.31% Gen, 84.08% Train). Generation: 7s, Training: 45s. Estimated remaining time: 13h 9m 56s. Estimated total time: 15h 9m 59s. Time estimates for 10 more iterations: 9m 5s, 100 more iterations: 1h 30m 59s, 500 more iterations: 7h 34m 59s. [2026-03-25 16:06:08,353][__main__][INFO] - Starting iteration 129. [2026-03-25 16:06:08,357][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:06:08,357][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:06:13,609][__main__][INFO] - Number of regex retries in iteration 129: 0 [2026-03-25 16:06:13,611][__main__][INFO] - agents played in iteration 129 are Bob, Alice [2026-03-25 16:06:14,074][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:06:14,136][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:06:14,137][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:06:14,137][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:06:14,835][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:06:15,488][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:06:16,100][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:06:16,756][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:06:17,413][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:06:18,069][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:06:18,727][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:06:19,386][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:06:20,044][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:06:20,700][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:06:21,359][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:06:22,016][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:06:22,673][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:06:23,330][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:06:23,989][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:06:24,645][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:06:25,305][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:06:25,962][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:06:26,619][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:06:27,277][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:06:27,935][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:06:28,593][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:06:29,251][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:06:29,910][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:06:30,568][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:06:31,224][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:06:31,881][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:06:32,538][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:06:33,194][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:06:33,851][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:06:34,508][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:06:35,166][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:06:35,823][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:06:36,479][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:06:37,137][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:06:37,794][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:06:38,451][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:06:39,109][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:06:39,768][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:06:40,425][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:06:41,082][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:06:41,740][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:06:42,397][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:06:43,054][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:06:43,711][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:06:44,373][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:06:45,028][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:06:45,686][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:06:46,680][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:06:47,338][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:06:47,995][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:06:48,651][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:06:49,308][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:06:49,967][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:06:50,625][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:06:51,286][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:06:51,942][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:06:52,600][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:06:53,258][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:06:53,916][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:06:54,573][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:06:55,232][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:06:55,890][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:06:56,547][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:06:57,205][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:06:58,051][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 16:06:59,427][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:06:59,429][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:06:59,431][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:07:00,919][__main__][INFO] - Iteration 130 took 52s (9.99% Gen, 87.17% Train). Generation: 5s, Training: 45s. Estimated remaining time: 12h 35m 9s. Estimated total time: 14h 36m 4s. Time estimates for 10 more iterations: 8m 45s, 100 more iterations: 1h 27m 36s, 500 more iterations: 7h 18m 2s. [2026-03-25 16:07:00,921][__main__][INFO] - Starting iteration 130. [2026-03-25 16:07:00,926][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:07:00,926][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:07:05,948][__main__][INFO] - Number of regex retries in iteration 130: 0 [2026-03-25 16:07:05,949][__main__][INFO] - agents played in iteration 130 are Bob, Alice [2026-03-25 16:07:06,553][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:07:06,616][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:07:06,617][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:07:06,617][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:07:07,284][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:07:07,895][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:07:08,553][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:07:09,212][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:07:09,869][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:07:10,528][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:07:11,191][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:07:11,850][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:07:12,508][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:07:13,165][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:07:13,824][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:07:14,482][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:07:15,139][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:07:15,798][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:07:16,456][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:07:17,113][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:07:17,771][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:07:18,428][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:07:19,087][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:07:19,742][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:07:20,400][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:07:21,058][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:07:21,717][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:07:22,374][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:07:23,032][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:07:23,691][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:07:24,349][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:07:25,006][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:07:25,665][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:07:26,323][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:07:26,982][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:07:27,639][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:07:28,297][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:07:28,954][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:07:29,612][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:07:30,270][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:07:30,927][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:07:31,584][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:07:32,243][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:07:32,902][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:07:33,559][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:07:34,217][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:07:34,874][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:07:35,531][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:07:36,189][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:07:36,847][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:07:37,505][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:07:38,162][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:07:39,143][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:07:39,801][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:07:40,460][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:07:41,118][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:07:41,776][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:07:42,434][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:07:43,091][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:07:43,749][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:07:44,406][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:07:45,063][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:07:45,721][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:07:46,378][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:07:47,037][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:07:47,697][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:07:48,356][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:07:49,015][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:07:49,674][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:07:50,461][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 16:07:51,987][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:07:51,990][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:07:51,992][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:07:53,480][__main__][INFO] - Iteration 131 took 52s (9.56% Gen, 87.60% Train). Generation: 5s, Training: 46s. Estimated remaining time: 12h 34m 8s. Estimated total time: 14h 35m 56s. Time estimates for 10 more iterations: 8m 45s, 100 more iterations: 1h 27m 35s, 500 more iterations: 7h 17m 58s. [2026-03-25 16:07:53,483][__main__][INFO] - Starting iteration 131. [2026-03-25 16:07:53,488][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:07:53,489][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:08:07,611][__main__][INFO] - Number of regex retries in iteration 131: 0 [2026-03-25 16:08:07,612][__main__][INFO] - agents played in iteration 131 are Bob, Alice [2026-03-25 16:08:08,195][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:08:08,258][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:08:08,259][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:08:08,259][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:08:09,002][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:08:09,620][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:08:10,284][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:08:10,938][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:08:11,598][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:08:12,256][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:08:12,914][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:08:13,571][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:08:14,230][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:08:14,889][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:08:15,546][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:08:16,205][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:08:16,863][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:08:17,521][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:08:18,181][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:08:18,838][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:08:19,497][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:08:20,154][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:08:20,812][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:08:21,471][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:08:22,129][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:08:22,787][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:08:23,446][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:08:24,105][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:08:24,762][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:08:25,420][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:08:26,078][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:08:26,736][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:08:27,396][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:08:28,055][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:08:28,713][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:08:29,372][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:08:30,030][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:08:30,690][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:08:31,348][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:08:32,007][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:08:32,665][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:08:33,323][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:08:33,981][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:08:34,639][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:08:35,297][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:08:35,955][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:08:36,613][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:08:37,271][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:08:37,929][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:08:38,588][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:08:39,247][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:08:39,905][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:08:40,901][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:08:41,560][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:08:42,218][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:08:42,877][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:08:43,534][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:08:44,191][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:08:44,851][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:08:45,508][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:08:46,166][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:08:46,821][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:08:47,478][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:08:48,137][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:08:48,794][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:08:49,452][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:08:50,113][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:08:50,768][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:08:51,426][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:08:52,251][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 16:08:53,686][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:08:53,689][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:08:53,690][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:08:55,174][__main__][INFO] - Iteration 132 took 1m 1s (22.89% Gen, 74.70% Train). Generation: 14s, Training: 46s. Estimated remaining time: 15h 5m 17s. Estimated total time: 17h 8m 7s. Time estimates for 10 more iterations: 10m 16s, 100 more iterations: 1h 42m 48s, 500 more iterations: 8h 34m 3s. [2026-03-25 16:08:55,178][__main__][INFO] - Starting iteration 132. [2026-03-25 16:08:55,184][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:08:55,185][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:09:01,835][__main__][INFO] - Number of regex retries in iteration 132: 0 [2026-03-25 16:09:01,837][__main__][INFO] - agents played in iteration 132 are Bob, Alice [2026-03-25 16:09:02,413][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:09:02,474][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:09:02,475][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:09:02,475][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:09:03,174][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:09:03,803][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:09:04,461][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:09:05,118][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:09:05,776][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:09:06,433][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:09:07,092][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:09:07,748][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:09:08,406][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:09:09,063][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:09:09,721][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:09:10,378][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:09:11,036][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:09:11,694][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:09:12,351][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:09:13,008][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:09:13,667][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:09:14,325][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:09:14,982][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:09:15,639][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:09:16,297][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:09:16,954][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:09:17,611][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:09:18,269][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:09:18,928][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:09:19,586][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:09:20,243][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:09:20,900][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:09:21,557][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:09:22,215][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:09:22,872][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:09:23,530][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:09:24,187][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:09:24,844][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:09:25,501][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:09:26,158][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:09:26,817][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:09:27,475][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:09:28,133][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:09:28,791][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:09:29,448][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:09:30,106][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:09:30,763][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:09:31,420][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:09:32,077][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:09:32,735][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:09:33,394][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:09:34,051][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:09:35,041][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:09:35,699][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:09:36,354][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:09:37,011][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:09:37,669][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:09:38,327][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:09:38,984][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:09:39,642][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:09:40,299][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:09:40,959][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:09:41,616][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:09:42,276][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:09:42,933][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:09:43,590][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:09:44,248][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:09:44,906][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:09:45,563][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:09:46,392][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 16:09:47,849][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:09:47,852][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:09:47,854][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:09:49,322][__main__][INFO] - Iteration 133 took 54s (12.29% Gen, 85.00% Train). Generation: 6s, Training: 46s. Estimated remaining time: 12h 58m 35s. Estimated total time: 15h 2m 19s. Time estimates for 10 more iterations: 9m 1s, 100 more iterations: 1h 30m 13s, 500 more iterations: 7h 31m 9s. [2026-03-25 16:09:49,324][__main__][INFO] - Starting iteration 133. [2026-03-25 16:09:49,329][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:09:49,329][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:10:00,333][__main__][INFO] - Number of regex retries in iteration 133: 0 [2026-03-25 16:10:00,334][__main__][INFO] - agents played in iteration 133 are Bob, Alice [2026-03-25 16:10:00,803][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:10:00,864][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:10:00,865][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:10:00,865][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:10:01,702][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:10:02,325][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:10:02,983][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:10:03,641][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:10:04,300][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:10:04,959][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:10:05,616][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:10:06,275][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:10:06,932][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:10:07,590][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:10:08,247][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:10:08,907][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:10:09,565][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:10:10,222][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:10:10,882][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:10:11,539][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:10:12,199][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:10:12,855][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:10:13,513][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:10:14,170][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:10:14,828][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:10:15,485][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:10:16,143][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:10:16,800][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:10:17,457][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:10:18,115][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:10:18,773][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:10:19,430][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:10:20,090][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:10:20,747][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:10:21,404][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:10:22,061][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:10:22,718][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:10:23,376][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:10:24,033][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:10:24,691][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:10:25,652][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:10:26,309][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:10:26,967][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:10:27,624][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:10:28,282][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:10:28,939][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:10:29,597][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:10:30,254][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:10:30,911][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:10:31,569][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:10:32,226][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:10:32,883][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:10:33,872][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:10:34,535][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:10:35,191][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:10:35,850][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:10:36,507][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:10:37,165][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:10:37,822][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:10:38,479][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:10:39,138][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:10:39,796][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:10:40,454][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:10:41,111][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:10:41,768][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:10:42,426][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:10:43,084][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:10:43,741][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:10:44,399][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:10:45,190][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 16:10:46,649][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:10:46,652][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:10:46,653][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:10:48,139][__main__][INFO] - Iteration 134 took 58s (18.71% Gen, 78.76% Train). Generation: 11s, Training: 46s. Estimated remaining time: 14h 15m 30s. Estimated total time: 16h 20m 12s. Time estimates for 10 more iterations: 9m 48s, 100 more iterations: 1h 38m 1s, 500 more iterations: 8h 10m 6s. [2026-03-25 16:10:48,142][__main__][INFO] - Starting iteration 134. [2026-03-25 16:10:48,145][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:10:48,146][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:10:53,926][__main__][INFO] - Number of regex retries in iteration 134: 0 [2026-03-25 16:10:53,927][__main__][INFO] - agents played in iteration 134 are Bob, Alice [2026-03-25 16:10:54,508][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:10:54,570][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:10:54,570][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:10:54,571][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:10:55,397][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:10:56,024][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:10:56,683][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:10:57,341][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:10:57,999][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:10:58,657][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:10:59,316][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:10:59,973][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:11:00,630][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:11:01,289][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:11:01,946][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:11:02,604][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:11:03,262][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:11:03,918][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:11:04,576][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:11:05,233][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:11:05,891][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:11:06,549][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:11:07,207][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:11:07,864][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:11:08,522][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:11:09,179][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:11:09,836][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:11:10,494][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:11:11,151][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:11:11,808][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:11:12,465][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:11:13,122][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:11:13,779][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:11:14,437][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:11:15,094][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:11:15,751][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:11:16,409][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:11:17,066][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:11:17,724][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:11:18,381][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:11:19,038][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:11:19,695][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:11:20,352][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:11:21,009][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:11:21,666][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:11:22,323][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:11:22,980][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:11:23,637][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:11:24,295][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:11:24,952][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:11:25,609][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:11:26,267][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:11:27,257][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:11:27,917][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:11:28,575][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:11:29,233][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:11:29,891][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:11:30,548][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:11:31,207][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:11:31,864][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:11:32,521][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:11:33,178][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:11:33,836][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:11:34,494][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:11:35,152][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:11:35,810][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:11:36,469][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:11:37,127][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:11:37,784][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:11:38,560][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 16:11:39,940][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:11:41,286][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:11:41,287][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:11:42,856][__main__][INFO] - Iteration 135 took 54s (10.57% Gen, 86.56% Train). Generation: 5s, Training: 47s. Estimated remaining time: 13h 6m 14s. Estimated total time: 15h 11m 51s. Time estimates for 10 more iterations: 9m 7s, 100 more iterations: 1h 31m 11s, 500 more iterations: 7h 35m 55s. [2026-03-25 16:11:42,858][__main__][INFO] - Starting iteration 135. [2026-03-25 16:11:42,863][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:11:42,864][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:11:54,885][__main__][INFO] - Number of regex retries in iteration 135: 0 [2026-03-25 16:11:54,886][__main__][INFO] - agents played in iteration 135 are Bob, Alice [2026-03-25 16:11:55,950][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:11:56,012][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:11:56,012][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:11:56,013][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:11:56,759][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:11:57,392][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:11:58,051][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:11:58,708][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:11:59,365][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:12:00,022][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:12:00,679][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:12:01,336][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:12:01,995][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:12:02,651][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:12:03,308][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:12:03,966][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:12:04,623][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:12:05,281][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:12:05,939][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:12:06,595][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:12:07,252][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:12:07,909][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:12:08,566][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:12:09,224][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:12:09,881][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:12:10,539][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:12:11,196][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:12:11,853][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:12:12,512][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:12:13,171][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:12:13,829][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:12:14,486][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:12:15,144][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:12:15,800][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:12:16,457][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:12:17,114][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:12:17,772][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:12:18,434][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:12:19,092][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:12:19,750][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:12:20,408][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:12:21,065][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:12:21,722][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:12:22,381][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:12:23,045][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:12:23,701][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:12:24,358][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:12:25,015][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:12:25,672][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:12:26,329][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:12:26,987][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:12:27,644][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:12:28,625][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:12:29,282][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:12:29,937][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:12:30,594][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:12:31,251][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:12:31,908][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:12:32,565][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:12:33,223][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:12:33,880][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:12:34,538][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:12:35,196][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:12:35,855][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:12:36,513][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:12:37,170][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:12:37,828][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:12:38,486][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:12:39,143][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:12:39,914][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 16:12:41,405][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:12:41,408][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:12:41,410][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:12:43,809][__main__][INFO] - Iteration 136 took 1m 0s (19.73% Gen, 76.33% Train). Generation: 12s, Training: 46s. Estimated remaining time: 14h 49m 9s. Estimated total time: 16h 55m 48s. Time estimates for 10 more iterations: 10m 9s, 100 more iterations: 1h 41m 34s, 500 more iterations: 8h 27m 54s. [2026-03-25 16:12:43,811][__main__][INFO] - Starting iteration 136. [2026-03-25 16:12:43,817][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:12:43,818][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:12:51,526][__main__][INFO] - Number of regex retries in iteration 136: 0 [2026-03-25 16:12:51,527][__main__][INFO] - agents played in iteration 136 are Bob, Alice [2026-03-25 16:12:52,105][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:12:52,168][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:12:52,168][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:12:52,169][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:12:52,824][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:12:53,438][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:12:54,095][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:12:54,752][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:12:55,417][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:12:56,071][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:12:56,729][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:12:57,386][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:12:58,043][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:12:58,700][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:12:59,359][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:13:00,016][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:13:00,674][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:13:01,331][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:13:01,988][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:13:02,647][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:13:03,305][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:13:03,962][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:13:04,620][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:13:05,277][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:13:05,935][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:13:06,593][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:13:07,251][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:13:07,908][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:13:08,565][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:13:09,222][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:13:09,880][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:13:10,539][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:13:11,196][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:13:11,853][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:13:12,511][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:13:13,169][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:13:13,827][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:13:14,484][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:13:15,142][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:13:15,799][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:13:16,456][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:13:17,114][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:13:17,771][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:13:18,428][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:13:19,086][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:13:19,743][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:13:20,401][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:13:21,059][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:13:21,717][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:13:22,374][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:13:23,031][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:13:23,689][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:13:24,683][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:13:25,341][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:13:25,999][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:13:26,656][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:13:27,316][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:13:27,974][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:13:28,633][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:13:29,290][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:13:29,948][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:13:30,609][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:13:31,267][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:13:31,926][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:13:32,583][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:13:33,241][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:13:33,900][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:13:34,558][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:13:35,215][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:13:36,041][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 16:13:37,343][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:13:37,346][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:13:37,347][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:13:38,795][__main__][INFO] - Iteration 137 took 54s (14.02% Gen, 83.34% Train). Generation: 7s, Training: 45s. Estimated remaining time: 13h 8m 47s. Estimated total time: 15h 16m 21s. Time estimates for 10 more iterations: 9m 9s, 100 more iterations: 1h 31m 38s, 500 more iterations: 7h 38m 10s. [2026-03-25 16:13:38,798][__main__][INFO] - Starting iteration 137. [2026-03-25 16:13:38,802][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:13:38,802][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:13:43,617][__main__][INFO] - Number of regex retries in iteration 137: 0 [2026-03-25 16:13:43,618][__main__][INFO] - agents played in iteration 137 are Bob, Alice [2026-03-25 16:13:44,084][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:13:44,146][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:13:44,146][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:13:44,147][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:13:44,975][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:13:45,591][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:13:46,250][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:13:46,908][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:13:47,566][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:13:48,224][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:13:48,881][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:13:49,539][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:13:50,195][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:13:50,854][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:13:51,511][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:13:52,169][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:13:52,827][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:13:53,485][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:13:54,146][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:13:54,804][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:13:55,463][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:13:56,120][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:13:56,778][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:13:57,436][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:13:58,093][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:13:58,752][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:13:59,411][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:14:00,070][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:14:00,727][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:14:01,384][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:14:02,042][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:14:02,699][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:14:03,356][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:14:04,014][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:14:04,671][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:14:05,327][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:14:05,985][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:14:06,642][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:14:07,300][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:14:07,957][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:14:08,614][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:14:09,273][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:14:09,931][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:14:10,588][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:14:11,246][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:14:11,903][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:14:12,561][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:14:13,219][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:14:13,877][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:14:14,534][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:14:15,193][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:14:15,850][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:14:16,833][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:14:17,491][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:14:18,151][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:14:18,807][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:14:19,465][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:14:20,123][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:14:20,780][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:14:21,437][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:14:22,094][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:14:22,751][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:14:23,408][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:14:24,066][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:14:24,723][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:14:25,380][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:14:26,040][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:14:26,697][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:14:27,355][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:14:28,182][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 16:14:29,507][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:14:29,510][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:14:29,511][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:14:30,997][__main__][INFO] - Iteration 138 took 52s (9.23% Gen, 87.92% Train). Generation: 4s, Training: 45s. Estimated remaining time: 12h 21m 31s. Estimated total time: 14h 29m 57s. Time estimates for 10 more iterations: 8m 41s, 100 more iterations: 1h 26m 59s, 500 more iterations: 7h 14m 58s. [2026-03-25 16:14:30,999][__main__][INFO] - Starting iteration 138. [2026-03-25 16:14:31,003][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:14:31,004][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:14:35,930][__main__][INFO] - Number of regex retries in iteration 138: 0 [2026-03-25 16:14:35,931][__main__][INFO] - agents played in iteration 138 are Bob, Alice [2026-03-25 16:14:36,393][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:14:36,454][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:14:36,455][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:14:36,455][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:14:37,290][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:14:37,899][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:14:38,561][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:14:39,220][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:14:39,877][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:14:40,534][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:14:41,197][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:14:41,853][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:14:42,509][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:14:43,169][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:14:43,827][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:14:44,485][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:14:45,142][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:14:45,801][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:14:46,460][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:14:47,118][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:14:47,775][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:14:48,432][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:14:49,090][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:14:49,748][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:14:50,406][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:14:51,066][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:14:51,724][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:14:52,382][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:14:53,039][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:14:53,696][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:14:54,353][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:14:55,010][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:14:55,668][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:14:56,326][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:14:56,984][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:14:57,643][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:14:58,302][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:14:58,961][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:14:59,619][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:15:00,277][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:15:00,936][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:15:01,594][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:15:02,253][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:15:02,911][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:15:03,568][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:15:04,225][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:15:04,882][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:15:05,540][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:15:06,197][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:15:06,855][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:15:07,513][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:15:08,170][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:15:09,155][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:15:09,815][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:15:10,473][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:15:11,132][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:15:11,788][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:15:12,446][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:15:13,107][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:15:13,765][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:15:14,423][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:15:15,082][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:15:15,740][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:15:16,397][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:15:17,055][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:15:17,713][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:15:18,370][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:15:19,028][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:15:19,685][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:15:20,418][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 16:15:21,769][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:15:21,772][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:15:21,773][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:15:23,204][__main__][INFO] - Iteration 139 took 52s (9.44% Gen, 87.82% Train). Generation: 4s, Training: 45s. Estimated remaining time: 12h 20m 44s. Estimated total time: 14h 30m 2s. Time estimates for 10 more iterations: 8m 42s, 100 more iterations: 1h 27m 0s, 500 more iterations: 7h 15m 1s. [2026-03-25 16:15:23,206][__main__][INFO] - Starting iteration 139. [2026-03-25 16:15:23,210][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:15:23,210][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:15:31,363][__main__][INFO] - Number of regex retries in iteration 139: 0 [2026-03-25 16:15:31,364][__main__][INFO] - agents played in iteration 139 are Bob, Alice [2026-03-25 16:15:31,936][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:15:31,999][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:15:31,999][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:15:32,000][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:15:32,681][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:15:33,299][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:15:33,955][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:15:34,612][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:15:35,271][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:15:35,929][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:15:36,588][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:15:37,246][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:15:37,905][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:15:38,563][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:15:39,221][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:15:39,881][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:15:40,539][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:15:41,196][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:15:41,855][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:15:42,512][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:15:43,170][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:15:43,828][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:15:44,487][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:15:45,145][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:15:45,803][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:15:46,460][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:15:47,118][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:15:47,776][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:15:48,434][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:15:49,091][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:15:49,748][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:15:50,406][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:15:51,065][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:15:51,723][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:15:52,380][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:15:53,037][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:15:53,694][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:15:54,351][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:15:55,008][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:15:55,665][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:15:56,323][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:15:56,980][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:15:57,637][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:15:58,295][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:15:58,953][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:15:59,610][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:16:00,268][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:16:00,925][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:16:01,583][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:16:02,240][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:16:02,898][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:16:03,555][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:16:04,541][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:16:05,199][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:16:05,859][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:16:06,517][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:16:07,175][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:16:07,834][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:16:08,490][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:16:09,148][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:16:09,807][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:16:10,465][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:16:11,123][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:16:11,781][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:16:12,438][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:16:13,096][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:16:13,755][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:16:14,414][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:16:15,071][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:16:15,848][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 16:16:17,164][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:16:17,166][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:16:17,167][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:16:19,002][__main__][INFO] - Iteration 140 took 55s (14.61% Gen, 82.09% Train). Generation: 8s, Training: 45s. Estimated remaining time: 13h 19m 40s. Estimated total time: 15h 29m 54s. Time estimates for 10 more iterations: 9m 17s, 100 more iterations: 1h 32m 59s, 500 more iterations: 7h 44m 57s. [2026-03-25 16:16:19,005][__main__][INFO] - Starting iteration 140. [2026-03-25 16:16:19,009][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:16:19,009][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:16:24,087][__main__][INFO] - Number of regex retries in iteration 140: 0 [2026-03-25 16:16:24,088][__main__][INFO] - agents played in iteration 140 are Bob, Alice [2026-03-25 16:16:24,736][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:16:24,801][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:16:24,802][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:16:24,802][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:16:25,472][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:16:26,087][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:16:26,745][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:16:27,402][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:16:28,062][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:16:28,720][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:16:29,381][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:16:30,038][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:16:30,695][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:16:31,354][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:16:32,011][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:16:32,670][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:16:33,325][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:16:33,984][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:16:34,641][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:16:35,299][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:16:35,956][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:16:36,614][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:16:37,269][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:16:37,928][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:16:38,585][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:16:39,242][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:16:39,899][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:16:40,557][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:16:41,214][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:16:41,871][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:16:42,529][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:16:43,187][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:16:43,845][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:16:44,502][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:16:45,159][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:16:45,816][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:16:46,474][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:16:47,132][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:16:47,789][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:16:48,447][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:16:49,105][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:16:49,762][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:16:50,419][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:16:51,076][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:16:51,734][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:16:52,392][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:16:53,050][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:16:53,708][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:16:54,369][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:16:55,028][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:16:55,686][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:16:56,344][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:16:57,333][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:16:57,990][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:16:58,649][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:16:59,307][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:16:59,965][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:17:00,621][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:17:01,279][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:17:01,938][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:17:02,596][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:17:03,253][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:17:03,911][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:17:04,568][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:17:05,224][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:17:05,883][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:17:06,540][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:17:07,197][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:17:07,858][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:17:08,796][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 16:17:10,374][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:17:10,377][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:17:10,378][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:17:12,266][__main__][INFO] - Iteration 141 took 53s (9.54% Gen, 86.91% Train). Generation: 5s, Training: 46s. Estimated remaining time: 12h 36m 32s. Estimated total time: 14h 47m 39s. Time estimates for 10 more iterations: 8m 52s, 100 more iterations: 1h 28m 45s, 500 more iterations: 7h 23m 49s. [2026-03-25 16:17:12,269][__main__][INFO] - Starting iteration 141. [2026-03-25 16:17:12,276][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:17:12,277][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:17:20,707][__main__][INFO] - Number of regex retries in iteration 141: 0 [2026-03-25 16:17:20,709][__main__][INFO] - agents played in iteration 141 are Bob, Alice [2026-03-25 16:17:21,760][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:17:21,822][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:17:21,823][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:17:21,823][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:17:22,526][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:17:23,132][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:17:23,789][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:17:24,448][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:17:25,104][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:17:25,760][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:17:26,419][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:17:27,074][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:17:27,731][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:17:28,391][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:17:29,050][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:17:29,707][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:17:30,365][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:17:31,023][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:17:31,680][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:17:32,338][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:17:32,995][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:17:33,653][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:17:34,310][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:17:34,967][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:17:35,625][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:17:36,281][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:17:36,939][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:17:37,597][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:17:38,254][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:17:38,911][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:17:39,571][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:17:40,228][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:17:40,885][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:17:41,542][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:17:42,199][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:17:42,856][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:17:43,515][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:17:44,172][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:17:44,829][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:17:45,487][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:17:46,144][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:17:46,801][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:17:47,458][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:17:48,115][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:17:48,774][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:17:49,432][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:17:50,086][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:17:50,743][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:17:51,400][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:17:52,059][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:17:52,716][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:17:53,373][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:17:54,357][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:17:55,015][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:17:55,672][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:17:56,330][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:17:56,987][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:17:57,644][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:17:58,303][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:17:58,961][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:17:59,618][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:18:00,275][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:18:00,932][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:18:01,589][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:18:02,247][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:18:02,904][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:18:03,561][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:18:04,219][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:18:04,877][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:18:05,655][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 16:18:07,090][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:18:07,093][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:18:07,095][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:18:08,624][__main__][INFO] - Iteration 142 took 56s (14.96% Gen, 82.32% Train). Generation: 8s, Training: 46s. Estimated remaining time: 13h 27m 6s. Estimated total time: 15h 39m 9s. Time estimates for 10 more iterations: 9m 23s, 100 more iterations: 1h 33m 54s, 500 more iterations: 7h 49m 34s. [2026-03-25 16:18:08,627][__main__][INFO] - Starting iteration 142. [2026-03-25 16:18:08,632][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:18:08,633][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:18:18,265][__main__][INFO] - Number of regex retries in iteration 142: 0 [2026-03-25 16:18:18,266][__main__][INFO] - agents played in iteration 142 are Bob, Alice [2026-03-25 16:18:18,739][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:18:18,802][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:18:18,802][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:18:18,803][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:18:19,454][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:18:20,076][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:18:20,723][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:18:21,381][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:18:22,037][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:18:22,700][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:18:23,357][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:18:24,015][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:18:24,674][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:18:25,334][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:18:25,990][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:18:26,649][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:18:27,307][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:18:27,966][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:18:28,624][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:18:29,280][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:18:29,935][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:18:30,593][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:18:31,250][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:18:31,909][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:18:32,566][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:18:33,224][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:18:33,882][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:18:34,540][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:18:35,198][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:18:35,856][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:18:36,514][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:18:37,172][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:18:37,828][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:18:38,486][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:18:39,143][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:18:39,801][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:18:40,459][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:18:41,117][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:18:41,774][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:18:42,431][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:18:43,089][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:18:43,747][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:18:44,404][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:18:45,062][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:18:45,719][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:18:46,376][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:18:47,034][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:18:47,691][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:18:48,349][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:18:49,006][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:18:49,663][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:18:50,320][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:18:51,311][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:18:51,971][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:18:52,627][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:18:53,285][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:18:53,948][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:18:54,606][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:18:55,263][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:18:55,921][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:18:56,578][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:18:57,235][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:18:57,895][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:18:58,555][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:18:59,221][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:18:59,878][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:19:00,538][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:19:01,199][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:19:01,858][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:19:02,790][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 16:19:04,167][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:19:04,170][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:19:04,171][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:19:05,709][__main__][INFO] - Iteration 143 took 57s (16.88% Gen, 80.42% Train). Generation: 9s, Training: 45s. Estimated remaining time: 13h 38m 18s. Estimated total time: 15h 51m 18s. Time estimates for 10 more iterations: 9m 30s, 100 more iterations: 1h 35m 7s, 500 more iterations: 7h 55m 39s. [2026-03-25 16:19:05,711][__main__][INFO] - Starting iteration 143. [2026-03-25 16:19:05,722][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:19:05,722][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:19:11,543][__main__][INFO] - Number of regex retries in iteration 143: 0 [2026-03-25 16:19:11,544][__main__][INFO] - agents played in iteration 143 are Bob, Alice [2026-03-25 16:19:12,005][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:19:12,068][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:19:12,069][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:19:12,070][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:19:12,741][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:19:13,346][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:19:14,004][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:19:14,660][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:19:15,320][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:19:15,981][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:19:16,640][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:19:17,299][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:19:17,956][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:19:18,615][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:19:19,272][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:19:19,931][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:19:20,588][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:19:21,246][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:19:21,903][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:19:22,561][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:19:23,220][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:19:23,877][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:19:24,535][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:19:25,192][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:19:25,849][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:19:26,506][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:19:27,163][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:19:27,822][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:19:28,480][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:19:29,139][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:19:29,797][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:19:30,453][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:19:31,111][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:19:31,768][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:19:32,425][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:19:33,083][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:19:33,741][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:19:34,398][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:19:35,057][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:19:35,715][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:19:36,373][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:19:37,030][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:19:37,689][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:19:38,348][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:19:39,006][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:19:39,664][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:19:40,321][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:19:40,978][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:19:41,636][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:19:42,293][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:19:42,951][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:19:43,609][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:19:44,602][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:19:45,260][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:19:45,917][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:19:46,574][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:19:47,232][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:19:47,890][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:19:48,547][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:19:49,204][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:19:49,862][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:19:50,520][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:19:51,177][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:19:51,836][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:19:52,493][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:19:53,151][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:19:53,813][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:19:54,471][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:19:55,129][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:19:55,916][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 16:19:57,259][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:19:57,263][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:19:57,265][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:19:59,148][__main__][INFO] - Iteration 144 took 53s (10.90% Gen, 85.58% Train). Generation: 5s, Training: 45s. Estimated remaining time: 12h 36m 33s. Estimated total time: 14h 50m 27s. Time estimates for 10 more iterations: 8m 54s, 100 more iterations: 1h 29m 2s, 500 more iterations: 7h 25m 13s. [2026-03-25 16:19:59,151][__main__][INFO] - Starting iteration 144. [2026-03-25 16:19:59,155][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:19:59,156][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:20:04,128][__main__][INFO] - Number of regex retries in iteration 144: 0 [2026-03-25 16:20:04,129][__main__][INFO] - agents played in iteration 144 are Bob, Alice [2026-03-25 16:20:04,681][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:20:04,742][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:20:04,743][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:20:04,744][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:20:05,400][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:20:06,011][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:20:06,669][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:20:07,328][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:20:07,985][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:20:08,647][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:20:09,309][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:20:09,962][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:20:10,620][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:20:11,278][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:20:11,937][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:20:12,593][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:20:13,252][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:20:13,910][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:20:14,567][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:20:15,225][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:20:15,882][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:20:16,540][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:20:17,197][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:20:17,855][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:20:18,513][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:20:19,170][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:20:19,826][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:20:20,485][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:20:21,142][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:20:21,800][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:20:22,458][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:20:23,115][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:20:23,773][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:20:24,431][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:20:25,089][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:20:25,747][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:20:26,404][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:20:27,061][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:20:27,718][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:20:28,751][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:20:30,457][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:20:31,113][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:20:31,770][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:20:32,426][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:20:33,083][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:20:33,741][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:20:34,400][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:20:35,057][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:20:35,715][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:20:36,372][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:20:37,030][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:20:37,687][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:20:38,675][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:20:39,338][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:20:39,995][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:20:40,654][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:20:41,312][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:20:41,970][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:20:42,629][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:20:43,287][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:20:43,944][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:20:44,603][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:20:45,260][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:20:45,917][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:20:46,574][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:20:47,232][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:20:47,890][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:20:48,548][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:20:49,206][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:20:50,038][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:44 [2026-03-25 16:20:51,391][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:20:51,394][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:20:51,397][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:20:53,453][__main__][INFO] - Iteration 145 took 54s (9.16% Gen, 87.05% Train). Generation: 4s, Training: 47s. Estimated remaining time: 12h 50m 12s. Estimated total time: 15h 5m 0s. Time estimates for 10 more iterations: 9m 3s, 100 more iterations: 1h 30m 30s, 500 more iterations: 7h 32m 30s. [2026-03-25 16:20:53,456][__main__][INFO] - Starting iteration 145. [2026-03-25 16:20:53,460][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:20:53,460][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:20:58,864][__main__][INFO] - Number of regex retries in iteration 145: 0 [2026-03-25 16:20:58,865][__main__][INFO] - agents played in iteration 145 are Bob, Alice [2026-03-25 16:20:59,372][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:20:59,434][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:20:59,434][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:20:59,435][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:21:00,192][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:21:00,796][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:21:01,456][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:21:02,114][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:21:02,772][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:21:03,430][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:21:04,090][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:21:04,748][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:21:05,407][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:21:06,067][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:21:06,725][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:21:07,386][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:21:08,044][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:21:08,704][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:21:09,363][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:21:10,022][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:21:10,683][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:21:11,342][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:21:12,001][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:21:12,660][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:21:13,317][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:21:13,976][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:21:14,635][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:21:15,293][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:21:15,953][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:21:16,611][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:21:17,271][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:21:17,929][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:21:18,588][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:21:19,246][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:21:19,904][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:21:20,562][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:21:21,220][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:21:21,879][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:21:22,537][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:21:23,195][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:21:23,855][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:21:24,514][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:21:25,173][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:21:25,831][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:21:26,490][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:21:27,149][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:21:27,807][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:21:28,466][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:21:29,125][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:21:29,784][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:21:30,442][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:21:31,100][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:21:32,080][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:21:32,739][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:21:33,399][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:21:34,058][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:21:34,716][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:21:35,374][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:21:36,033][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:21:36,690][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:21:37,349][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:21:38,006][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:21:38,665][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:21:39,323][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:21:39,981][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:21:40,639][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:21:41,297][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:21:41,954][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:21:42,613][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:21:43,439][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 16:21:45,235][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:21:45,238][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:21:45,240][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:21:46,754][__main__][INFO] - Iteration 146 took 53s (10.14% Gen, 87.01% Train). Generation: 5s, Training: 46s. Estimated remaining time: 12h 32m 34s. Estimated total time: 14h 48m 16s. Time estimates for 10 more iterations: 8m 52s, 100 more iterations: 1h 28m 49s, 500 more iterations: 7h 24m 8s. [2026-03-25 16:21:46,757][__main__][INFO] - Starting iteration 146. [2026-03-25 16:21:46,761][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:21:46,762][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:21:57,710][__main__][INFO] - Number of regex retries in iteration 146: 0 [2026-03-25 16:21:57,711][__main__][INFO] - agents played in iteration 146 are Bob, Alice [2026-03-25 16:21:58,715][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:21:58,778][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:21:58,779][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:21:58,779][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:21:59,424][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:22:00,065][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:22:00,723][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:22:01,384][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:22:02,041][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:22:02,699][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:22:03,358][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:22:04,014][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:22:04,673][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:22:05,332][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:22:05,990][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:22:06,648][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:22:07,306][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:22:08,087][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:22:08,774][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:22:09,433][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:22:10,091][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:22:10,747][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:22:11,404][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:22:12,061][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:22:12,719][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:22:13,377][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:22:14,034][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:22:14,694][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:22:15,353][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:22:16,010][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:22:16,668][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:22:17,325][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:22:17,983][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:22:18,645][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:22:19,303][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:22:19,959][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:22:20,617][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:22:21,274][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:22:21,932][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:22:22,589][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:22:23,246][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:22:23,904][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:22:24,562][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:22:25,219][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:22:25,877][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:22:26,534][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:22:27,192][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:22:27,850][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:22:28,508][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:22:29,165][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:22:29,822][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:22:30,480][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:22:31,469][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:22:32,128][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:22:32,785][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:22:33,442][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:22:34,100][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:22:34,760][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:22:35,419][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:22:36,077][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:22:36,735][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:22:37,392][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:22:38,050][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:22:38,709][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:22:39,367][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:22:40,025][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:22:40,683][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:22:41,341][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:22:41,999][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:22:42,813][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 16:22:44,580][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:22:44,583][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:22:44,584][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:22:46,196][__main__][INFO] - Iteration 147 took 59s (18.42% Gen, 78.91% Train). Generation: 10s, Training: 46s. Estimated remaining time: 14h 13m 56s. Estimated total time: 16h 30m 37s. Time estimates for 10 more iterations: 9m 54s, 100 more iterations: 1h 39m 3s, 500 more iterations: 8h 15m 18s. [2026-03-25 16:22:46,204][__main__][INFO] - Starting iteration 147. [2026-03-25 16:22:46,238][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:22:46,239][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:23:01,143][__main__][INFO] - Number of regex retries in iteration 147: 0 [2026-03-25 16:23:01,145][__main__][INFO] - agents played in iteration 147 are Bob, Alice [2026-03-25 16:23:01,710][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:23:01,771][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:23:01,772][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:23:01,773][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:23:02,648][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:23:03,258][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:23:03,916][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:23:04,573][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:23:05,231][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:23:05,888][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:23:06,545][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:23:07,203][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:23:07,860][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:23:08,518][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:23:09,175][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:23:09,832][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:23:10,489][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:23:11,146][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:23:11,804][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:23:12,461][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:23:13,118][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:23:13,776][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:23:14,434][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:23:15,092][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:23:15,749][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:23:16,406][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:23:17,063][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:23:17,720][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:23:18,377][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:23:19,035][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:23:19,693][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:23:20,350][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:23:21,008][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:23:21,665][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:23:22,322][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:23:22,979][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:23:23,636][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:23:24,293][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:23:24,950][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:23:25,608][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:23:26,265][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:23:26,923][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:23:27,580][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:23:28,238][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:23:28,896][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:23:29,553][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:23:30,211][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:23:30,868][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:23:31,525][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:23:32,182][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:23:32,839][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:23:33,496][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:23:34,490][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:23:35,149][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:23:35,809][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:23:36,467][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:23:37,125][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:23:37,783][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:23:38,441][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:23:39,099][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:23:39,758][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:23:40,417][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:23:41,074][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:23:41,732][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:23:42,391][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:23:43,050][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:23:43,708][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:23:44,366][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:23:45,024][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:23:45,803][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 16:23:47,766][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:23:47,769][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:23:47,770][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:23:49,220][__main__][INFO] - Iteration 148 took 1m 2s (23.67% Gen, 74.03% Train). Generation: 14s, Training: 46s. Estimated remaining time: 15h 12m 1s. Estimated total time: 17h 29m 44s. Time estimates for 10 more iterations: 10m 29s, 100 more iterations: 1h 44m 58s, 500 more iterations: 8h 44m 52s. [2026-03-25 16:23:49,222][__main__][INFO] - Starting iteration 148. [2026-03-25 16:23:49,226][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:23:49,226][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:23:53,973][__main__][INFO] - Number of regex retries in iteration 148: 0 [2026-03-25 16:23:53,974][__main__][INFO] - agents played in iteration 148 are Bob, Alice [2026-03-25 16:23:54,436][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:23:54,497][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:23:54,497][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:23:54,499][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:23:55,272][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:23:55,899][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:23:56,560][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:23:57,217][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:23:57,875][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:23:58,533][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:23:59,191][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:23:59,849][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:24:00,508][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:24:01,165][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:24:01,823][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:24:02,480][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:24:03,138][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:24:03,796][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:24:04,454][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:24:05,111][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:24:05,769][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:24:06,426][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:24:07,083][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:24:07,740][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:24:08,399][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:24:09,056][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:24:09,714][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:24:10,372][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:24:11,029][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:24:11,689][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:24:12,346][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:24:13,004][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:24:13,662][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:24:14,320][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:24:14,978][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:24:15,636][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:24:16,293][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:24:16,952][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:24:17,609][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:24:18,267][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:24:18,924][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:24:19,581][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:24:20,240][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:24:20,898][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:24:21,555][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:24:22,213][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:24:22,871][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:24:23,528][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:24:24,186][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:24:24,844][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:24:25,502][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:24:26,160][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:24:27,143][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:24:27,801][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:24:28,461][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:24:29,119][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:24:29,777][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:24:30,435][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:24:31,092][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:24:31,751][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:24:32,408][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:24:33,065][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:24:33,723][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:24:34,383][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:24:35,038][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:24:35,696][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:24:36,354][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:24:37,011][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:24:37,671][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:24:38,457][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 16:24:40,298][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:24:40,302][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:24:40,303][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:24:41,846][__main__][INFO] - Iteration 149 took 52s (9.02% Gen, 88.04% Train). Generation: 4s, Training: 46s. Estimated remaining time: 12h 18m 25s. Estimated total time: 14h 37m 1s. Time estimates for 10 more iterations: 8m 46s, 100 more iterations: 1h 27m 42s, 500 more iterations: 7h 18m 30s. [2026-03-25 16:24:41,848][__main__][INFO] - Starting iteration 149. [2026-03-25 16:24:41,852][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:24:41,853][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:25:01,442][__main__][INFO] - Number of regex retries in iteration 149: 0 [2026-03-25 16:25:01,443][__main__][INFO] - agents played in iteration 149 are Bob, Alice [2026-03-25 16:25:01,982][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:25:02,044][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:25:02,044][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:25:02,045][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:25:02,713][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:25:03,326][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:25:03,984][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:25:04,641][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:25:05,298][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:25:05,955][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:25:06,612][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:25:07,270][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:25:07,928][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:25:08,585][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:25:09,243][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:25:09,900][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:25:10,557][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:25:11,215][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:25:11,872][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:25:12,529][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:25:13,186][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:25:13,842][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:25:14,501][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:25:15,158][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:25:15,816][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:25:16,472][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:25:17,130][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:25:17,787][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:25:18,444][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:25:19,101][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:25:19,758][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:25:20,414][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:25:21,071][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:25:21,728][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:25:22,386][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:25:23,045][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:25:23,702][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:25:24,360][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:25:25,019][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:25:25,676][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:25:26,334][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:25:26,991][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:25:27,648][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:25:28,307][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:25:28,964][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:25:29,621][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:25:30,278][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:25:30,936][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:25:31,593][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:25:32,250][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:25:32,907][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:25:33,565][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:25:34,549][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:25:35,208][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:25:35,866][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:25:36,524][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:25:37,181][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:25:37,841][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:25:38,498][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:25:39,155][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:25:39,813][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:25:40,470][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:25:41,127][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:25:41,784][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:25:42,441][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:25:43,099][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:25:43,757][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:25:44,419][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:25:45,078][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:25:45,946][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 16:25:47,680][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:25:47,682][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:25:47,684][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:25:49,189][__main__][INFO] - Iteration 150 took 1m 7s (29.09% Gen, 68.67% Train). Generation: 19s, Training: 46s. Estimated remaining time: 16h 22m 35s. Estimated total time: 18h 42m 18s. Time estimates for 10 more iterations: 11m 13s, 100 more iterations: 1h 52m 13s, 500 more iterations: 9h 21m 9s. [2026-03-25 16:25:49,192][__main__][INFO] - Starting iteration 150. [2026-03-25 16:25:49,197][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:25:49,198][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:26:10,517][__main__][INFO] - Number of regex retries in iteration 150: 0 [2026-03-25 16:26:10,519][__main__][INFO] - agents played in iteration 150 are Bob, Alice [2026-03-25 16:26:11,543][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:26:11,611][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:26:11,611][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:26:11,612][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:26:12,323][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:26:12,985][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:26:13,643][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:26:14,301][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:26:14,958][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:26:15,614][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:26:16,271][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:26:16,929][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:26:17,585][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:26:18,243][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:26:18,900][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:26:19,558][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:26:20,215][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:26:20,873][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:26:21,531][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:26:22,187][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:26:22,845][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:26:23,501][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:26:24,159][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:26:24,817][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:26:25,474][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:26:26,131][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:26:26,789][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:26:27,446][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:26:28,103][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:26:28,761][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:26:29,418][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:26:30,075][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:26:30,733][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:26:31,391][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:26:32,048][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:26:32,706][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:26:33,365][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:26:34,022][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:26:34,679][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:26:35,336][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:26:35,993][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:26:36,650][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:26:37,307][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:26:37,964][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:26:38,621][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:26:39,278][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:26:39,935][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:26:40,594][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:26:41,252][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:26:41,909][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:26:42,566][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:26:43,223][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:26:44,205][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:26:44,868][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:26:45,525][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:26:46,183][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:26:46,844][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:26:47,503][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:26:48,161][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:26:48,821][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:26:49,479][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:26:50,135][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:26:50,792][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:26:51,450][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:26:52,108][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:26:52,766][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:26:53,424][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:26:54,082][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:26:54,740][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:26:55,641][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 16:26:57,046][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:26:57,048][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:26:57,049][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:26:59,984][__main__][INFO] - Iteration 151 took 1m 10s (30.12% Gen, 65.73% Train). Generation: 21s, Training: 46s. Estimated remaining time: 17h 18m 54s. Estimated total time: 19h 39m 49s. Time estimates for 10 more iterations: 11m 47s, 100 more iterations: 1h 57m 58s, 500 more iterations: 9h 49m 54s. [2026-03-25 16:26:59,986][__main__][INFO] - Starting iteration 151. [2026-03-25 16:26:59,990][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 16:26:59,990][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:27:05,225][__main__][INFO] - Number of regex retries in iteration 151: 0 [2026-03-25 16:27:05,226][__main__][INFO] - agents played in iteration 151 are Bob, Alice [2026-03-25 16:27:05,826][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:27:05,888][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:27:05,888][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:27:05,889][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:27:06,594][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:27:07,200][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:27:07,858][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:27:08,515][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:27:09,178][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:27:09,836][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:27:10,495][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:27:11,152][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:27:11,809][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:27:12,467][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:27:13,125][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:27:13,782][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:27:14,442][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:27:15,099][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:27:16,241][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:27:16,899][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:27:17,558][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:27:18,216][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:27:18,873][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:27:19,530][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:27:20,187][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:27:20,845][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:27:21,502][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:27:22,160][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:27:22,817][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:27:23,474][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:27:24,131][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:27:24,788][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:27:25,445][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:27:26,102][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:27:27,899][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:27:28,946][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:27:29,603][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:27:30,260][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:27:30,917][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:27:31,574][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:27:32,231][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:27:32,889][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:27:33,546][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:27:34,202][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:27:34,862][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:27:36,485][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:27:37,143][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:27:37,800][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:27:38,457][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:27:39,116][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:27:39,774][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:27:40,431][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:27:41,406][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:27:42,064][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:27:42,722][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:27:43,380][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:27:44,037][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:27:44,694][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:27:45,351][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:27:46,010][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:27:46,667][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:27:47,325][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:27:47,982][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:27:48,643][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:27:49,301][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:27:49,959][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:27:50,618][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:27:51,275][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:27:51,933][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:27:52,711][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:46 [2026-03-25 16:27:54,036][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:27:54,038][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:27:54,039][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:27:55,515][__main__][INFO] - Iteration 152 took 55s (9.43% Gen, 87.91% Train). Generation: 5s, Training: 48s. Estimated remaining time: 13h 3m 36s. Estimated total time: 15h 25m 26s. Time estimates for 10 more iterations: 9m 15s, 100 more iterations: 1h 32m 32s, 500 more iterations: 7h 42m 43s. [2026-03-25 16:27:58,005][__main__][INFO] - Starting iteration 152. [2026-03-25 16:27:58,012][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 16:27:58,012][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:28:09,729][__main__][INFO] - Number of regex retries in iteration 152: 0 [2026-03-25 16:28:09,730][__main__][INFO] - agents played in iteration 152 are Bob, Alice [2026-03-25 16:28:10,234][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:28:10,295][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:28:10,295][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:28:10,296][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:28:11,117][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:28:11,746][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:28:12,404][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:28:13,061][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:28:13,718][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:28:14,375][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:28:15,033][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:28:15,689][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:28:16,347][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:28:17,004][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:28:17,662][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:28:18,319][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:28:18,977][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:28:19,634][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:28:20,292][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:28:20,950][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:28:21,608][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:28:22,264][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:28:22,921][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:28:23,579][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:28:24,237][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:28:24,894][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:28:25,551][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:28:26,209][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:28:26,867][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:28:27,524][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:28:28,182][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:28:28,840][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:28:29,498][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:28:30,155][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:28:30,812][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:28:31,469][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:28:32,126][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:28:32,783][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:28:33,440][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:28:34,098][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:28:34,755][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:28:35,411][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:28:36,069][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:28:36,725][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:28:37,383][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:28:38,040][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:28:38,697][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:28:39,355][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:28:40,013][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:28:40,671][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:28:41,329][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:28:41,986][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:28:42,975][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:28:43,635][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:28:44,291][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:28:44,949][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:28:45,610][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:28:46,266][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:28:46,926][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:28:47,584][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:28:48,242][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:28:48,900][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:28:49,559][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:28:50,219][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:28:50,876][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:28:51,534][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:28:52,191][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:28:52,848][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:28:53,506][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:28:54,257][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 16:28:59,171][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:28:59,174][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:28:59,176][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:29:00,787][__main__][INFO] - Iteration 153 took 1m 2s (18.66% Gen, 78.76% Train). Generation: 11s, Training: 49s. Estimated remaining time: 15h 3m 22s. Estimated total time: 17h 26m 17s. Time estimates for 10 more iterations: 10m 27s, 100 more iterations: 1h 44m 37s, 500 more iterations: 8h 43m 8s. [2026-03-25 16:29:00,790][__main__][INFO] - Starting iteration 153. [2026-03-25 16:29:00,794][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 16:29:00,794][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:29:07,694][__main__][INFO] - Number of regex retries in iteration 153: 0 [2026-03-25 16:29:07,695][__main__][INFO] - agents played in iteration 153 are Bob, Alice [2026-03-25 16:29:08,750][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:29:08,812][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:29:08,813][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:29:08,813][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:29:09,498][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:29:10,109][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:29:10,766][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:29:11,423][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:29:12,081][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:29:12,739][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:29:13,399][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:29:14,057][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:29:14,714][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:29:15,371][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:29:16,029][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:29:16,688][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:29:17,346][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:29:18,003][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:29:18,660][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:29:19,317][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:29:19,975][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:29:20,631][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:29:21,288][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:29:21,945][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:29:22,603][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:29:23,260][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:29:23,917][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:29:24,574][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:29:25,231][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:29:25,888][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:29:26,548][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:29:27,206][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:29:27,864][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:29:28,522][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:29:29,180][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:29:29,838][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:29:30,495][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:29:31,152][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:29:31,809][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:29:32,466][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:29:33,123][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:29:33,780][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:29:34,438][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:29:35,095][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:29:35,753][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:29:36,411][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:29:37,068][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:29:37,725][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:29:38,384][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:29:39,041][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:29:39,699][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:29:40,356][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:29:41,339][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:29:41,997][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:29:42,655][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:29:43,314][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:29:43,972][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:29:44,630][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:29:45,289][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:29:45,946][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:29:46,603][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:29:47,262][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:29:47,923][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:29:48,579][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:29:49,239][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:29:49,898][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:29:50,554][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:29:51,213][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:29:51,872][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:29:52,799][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 16:29:54,124][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:29:54,126][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:29:54,128][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:29:55,560][__main__][INFO] - Iteration 154 took 54s (12.60% Gen, 84.78% Train). Generation: 6s, Training: 46s. Estimated remaining time: 12h 48m 57s. Estimated total time: 15h 12m 47s. Time estimates for 10 more iterations: 9m 7s, 100 more iterations: 1h 31m 16s, 500 more iterations: 7h 36m 23s. [2026-03-25 16:29:55,562][__main__][INFO] - Starting iteration 154. [2026-03-25 16:29:55,569][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 16:29:55,569][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:30:01,582][__main__][INFO] - Number of regex retries in iteration 154: 0 [2026-03-25 16:30:01,582][__main__][INFO] - agents played in iteration 154 are Bob, Alice [2026-03-25 16:30:02,598][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:30:02,659][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:30:02,660][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:30:02,660][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:30:03,363][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:30:03,975][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:30:04,633][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:30:05,292][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:30:05,950][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:30:06,607][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:30:07,271][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:30:07,925][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:30:08,583][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:30:09,241][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:30:09,899][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:30:10,558][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:30:11,217][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:30:11,874][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:30:12,532][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:30:13,190][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:30:13,849][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:30:14,505][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:30:15,162][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:30:15,819][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:30:16,475][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:30:17,132][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:30:17,790][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:30:18,448][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:30:19,107][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:30:19,765][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:30:20,422][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:30:21,079][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:30:21,736][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:30:22,393][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:30:23,050][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:30:23,707][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:30:24,364][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:30:25,026][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:30:25,682][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:30:26,340][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:30:26,998][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:30:27,656][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:30:28,315][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:30:28,973][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:30:29,631][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:30:30,288][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:30:30,946][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:30:31,604][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:30:32,262][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:30:32,919][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:30:33,577][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:30:34,236][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:30:35,217][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:30:35,875][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:30:36,534][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:30:37,192][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:30:37,850][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:30:38,508][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:30:39,165][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:30:39,824][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:30:40,483][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:30:41,141][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:30:41,799][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:30:42,457][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:30:43,118][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:30:43,776][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:30:44,433][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:30:45,092][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:30:45,749][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:30:46,538][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 16:30:48,376][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:30:48,378][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:30:48,380][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:30:49,967][__main__][INFO] - Iteration 155 took 54s (11.05% Gen, 86.03% Train). Generation: 6s, Training: 46s. Estimated remaining time: 12h 41m 55s. Estimated total time: 15h 6m 40s. Time estimates for 10 more iterations: 9m 4s, 100 more iterations: 1h 30m 40s, 500 more iterations: 7h 33m 20s. [2026-03-25 16:30:49,969][__main__][INFO] - Starting iteration 155. [2026-03-25 16:30:49,974][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 16:30:49,975][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:30:55,715][__main__][INFO] - Number of regex retries in iteration 155: 0 [2026-03-25 16:30:55,717][__main__][INFO] - agents played in iteration 155 are Bob, Alice [2026-03-25 16:30:56,290][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:30:56,351][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:30:56,352][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:30:56,353][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:30:57,087][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:30:57,711][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:30:58,370][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:30:59,026][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:30:59,682][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:31:00,338][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:31:00,995][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:31:01,651][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:31:02,308][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:31:02,965][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:31:03,622][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:31:04,279][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:31:04,936][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:31:05,592][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:31:06,250][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:31:06,907][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:31:07,564][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:31:08,221][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:31:08,878][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:31:09,544][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:31:10,201][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:31:10,860][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:31:11,517][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:31:12,174][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:31:12,831][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:31:13,488][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:31:14,146][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:31:14,804][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:31:15,462][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:31:16,118][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:31:16,775][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:31:17,432][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:31:18,088][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:31:18,745][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:31:19,402][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:31:20,059][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:31:20,716][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:31:21,377][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:31:22,034][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:31:22,692][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:31:23,349][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:31:24,006][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:31:24,663][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:31:25,320][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:31:25,977][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:31:26,634][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:31:27,290][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:31:27,948][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:31:28,930][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:31:29,589][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:31:30,247][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:31:30,904][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:31:31,563][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:31:32,220][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:31:32,878][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:31:33,535][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:31:34,193][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:31:34,850][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:31:35,507][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:31:36,164][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:31:36,821][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:31:37,541][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:31:38,197][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:31:38,854][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:31:39,512][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:31:40,312][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 16:31:41,634][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:31:41,637][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:31:41,638][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:31:43,258][__main__][INFO] - Iteration 156 took 53s (10.78% Gen, 86.18% Train). Generation: 5s, Training: 45s. Estimated remaining time: 12h 22m 28s. Estimated total time: 14h 48m 5s. Time estimates for 10 more iterations: 8m 52s, 100 more iterations: 1h 28m 48s, 500 more iterations: 7h 24m 2s. [2026-03-25 16:31:43,261][__main__][INFO] - Starting iteration 156. [2026-03-25 16:31:43,268][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 16:31:43,269][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:31:48,648][__main__][INFO] - Number of regex retries in iteration 156: 0 [2026-03-25 16:31:48,650][__main__][INFO] - agents played in iteration 156 are Bob, Alice [2026-03-25 16:31:49,145][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:31:49,206][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:31:49,207][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:31:49,207][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:31:49,998][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:31:50,609][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:31:51,267][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:31:51,925][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:31:52,582][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:31:53,241][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:31:53,898][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:31:54,556][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:31:55,212][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:31:55,869][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:31:56,526][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:31:57,183][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:31:57,840][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:31:58,498][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:31:59,155][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:31:59,812][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:32:00,469][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:32:01,126][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:32:01,784][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:32:02,441][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:32:03,100][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:32:03,757][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:32:04,414][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:32:05,070][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:32:05,727][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:32:06,385][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:32:07,042][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:32:07,700][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:32:08,358][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:32:09,015][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:32:09,674][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:32:10,331][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:32:10,989][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:32:11,646][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:32:12,302][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:32:12,960][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:32:13,617][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:32:14,274][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:32:14,931][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:32:15,588][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:32:16,246][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:32:16,903][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:32:17,561][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:32:18,219][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:32:18,877][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:32:19,534][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:32:20,191][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:32:20,850][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:32:21,842][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:32:22,500][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:32:23,159][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:32:23,817][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:32:24,474][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:32:25,133][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:32:25,790][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:32:26,447][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:32:27,105][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:32:27,761][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:32:28,420][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:32:29,078][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:32:29,735][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:32:30,394][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:32:31,052][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:32:31,709][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:32:32,366][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:32:33,274][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 16:32:34,603][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:32:34,606][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:32:34,607][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:32:36,062][__main__][INFO] - Iteration 157 took 52s (10.19% Gen, 87.04% Train). Generation: 5s, Training: 45s. Estimated remaining time: 12h 13m 25s. Estimated total time: 14h 39m 56s. Time estimates for 10 more iterations: 8m 47s, 100 more iterations: 1h 27m 59s, 500 more iterations: 7h 19m 58s. [2026-03-25 16:32:36,064][__main__][INFO] - Starting iteration 157. [2026-03-25 16:32:36,068][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 16:32:36,069][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:32:41,849][__main__][INFO] - Number of regex retries in iteration 157: 0 [2026-03-25 16:32:41,850][__main__][INFO] - agents played in iteration 157 are Bob, Alice [2026-03-25 16:32:42,421][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:32:42,482][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:32:42,482][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:32:42,483][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:32:43,203][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:32:43,808][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:32:44,464][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:32:45,121][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:32:45,778][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:32:46,436][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:32:47,092][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:32:47,750][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:32:48,407][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:32:49,065][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:32:49,723][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:32:50,381][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:32:51,038][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:32:51,694][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:32:52,352][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:32:53,009][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:32:53,666][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:32:54,325][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:32:54,982][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:32:55,639][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:32:56,296][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:32:56,953][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:32:57,611][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:32:58,269][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:32:58,928][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:32:59,585][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:33:00,242][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:33:00,899][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:33:01,556][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:33:02,214][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:33:02,872][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:33:03,529][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:33:04,186][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:33:04,842][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:33:05,499][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:33:06,156][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:33:06,813][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:33:07,470][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:33:08,127][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:33:08,785][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:33:09,442][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:33:10,101][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:33:10,758][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:33:11,416][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:33:12,073][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:33:12,730][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:33:13,389][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:33:14,047][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:33:15,039][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:33:15,697][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:33:16,354][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:33:17,011][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:33:17,668][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:33:18,324][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:33:18,981][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:33:19,640][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:33:20,299][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:33:20,956][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:33:21,615][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:33:22,271][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:33:22,929][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:33:23,586][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:33:24,243][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:33:24,901][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:33:25,558][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:33:26,328][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 16:33:27,670][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:33:27,673][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:33:27,674][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:33:29,184][__main__][INFO] - Iteration 158 took 53s (10.88% Gen, 86.27% Train). Generation: 5s, Training: 45s. Estimated remaining time: 12h 17m 54s. Estimated total time: 14h 45m 18s. Time estimates for 10 more iterations: 8m 51s, 100 more iterations: 1h 28m 31s, 500 more iterations: 7h 22m 39s. [2026-03-25 16:33:29,187][__main__][INFO] - Starting iteration 158. [2026-03-25 16:33:29,190][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 16:33:29,191][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:33:39,345][__main__][INFO] - Number of regex retries in iteration 158: 0 [2026-03-25 16:33:39,346][__main__][INFO] - agents played in iteration 158 are Bob, Alice [2026-03-25 16:33:39,837][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:33:39,898][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:33:39,899][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:33:39,899][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:33:40,572][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:33:41,189][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:33:41,847][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:33:42,504][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:33:43,162][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:33:43,819][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:33:44,477][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:33:45,133][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:33:45,791][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:33:46,448][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:33:47,108][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:33:47,766][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:33:48,423][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:33:49,080][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:33:49,737][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:33:50,395][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:33:51,053][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:33:51,710][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:33:52,369][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:33:53,026][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:33:53,683][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:33:54,340][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:33:54,997][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:33:55,654][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:33:56,312][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:33:56,969][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:33:57,626][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:33:58,284][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:33:58,942][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:33:59,599][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:34:00,258][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:34:00,913][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:34:01,571][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:34:02,228][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:34:02,884][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:34:03,542][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:34:04,198][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:34:04,855][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:34:05,513][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:34:06,171][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:34:06,830][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:34:07,488][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:34:08,145][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:34:08,804][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:34:09,463][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:34:10,120][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:34:10,778][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:34:11,436][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:34:12,418][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:34:13,076][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:34:13,734][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:34:14,391][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:34:15,049][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:34:15,707][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:34:16,365][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:34:17,023][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:34:17,680][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:34:18,339][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:34:18,997][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:34:19,654][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:34:20,313][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:34:20,970][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:34:21,627][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:34:22,284][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:34:22,941][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:34:23,837][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 16:34:25,745][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:34:25,748][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:34:25,749][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:34:27,258][__main__][INFO] - Iteration 159 took 58s (17.49% Gen, 79.91% Train). Generation: 10s, Training: 46s. Estimated remaining time: 13h 39m 27s. Estimated total time: 16h 7m 49s. Time estimates for 10 more iterations: 9m 40s, 100 more iterations: 1h 36m 46s, 500 more iterations: 8h 3m 54s. [2026-03-25 16:34:27,260][__main__][INFO] - Starting iteration 159. [2026-03-25 16:34:27,264][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 16:34:27,265][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:34:44,848][__main__][INFO] - Number of regex retries in iteration 159: 0 [2026-03-25 16:34:44,850][__main__][INFO] - agents played in iteration 159 are Bob, Alice [2026-03-25 16:34:45,896][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:34:45,957][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:34:45,958][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:34:45,958][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:34:46,621][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:34:47,250][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:34:47,908][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:34:48,565][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:34:49,222][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:34:49,878][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:34:50,537][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:34:51,197][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:34:51,851][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:34:52,509][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:34:53,167][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:34:53,824][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:34:54,483][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:34:55,139][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:34:55,798][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:34:56,454][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:34:57,112][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:34:57,770][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:34:58,427][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:34:59,084][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:34:59,742][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:35:00,399][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:35:01,056][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:35:01,713][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:35:02,370][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:35:03,027][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:35:03,684][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:35:04,342][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:35:04,999][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:35:05,658][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:35:06,314][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:35:06,971][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:35:07,628][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:35:08,285][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:35:08,942][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:35:09,599][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:35:10,257][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:35:10,915][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:35:11,572][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:35:12,229][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:35:12,887][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:35:13,544][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:35:14,201][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:35:14,859][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:35:15,516][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:35:16,178][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:35:16,836][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:35:17,493][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:35:18,477][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:35:19,135][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:35:19,793][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:35:20,450][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:35:21,107][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:35:21,764][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:35:22,421][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:35:23,078][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:35:23,736][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:35:24,394][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:35:25,051][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:35:25,709][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:35:26,367][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:35:27,024][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:35:27,681][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:35:28,339][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:35:28,997][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:35:29,769][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 16:35:31,290][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:35:31,981][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:35:31,982][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:35:33,591][__main__][INFO] - Iteration 160 took 1m 6s (26.51% Gen, 71.06% Train). Generation: 17s, Training: 47s. Estimated remaining time: 15h 56m 0s. Estimated total time: 18h 25m 28s. Time estimates for 10 more iterations: 11m 3s, 100 more iterations: 1h 50m 32s, 500 more iterations: 9h 12m 44s. [2026-03-25 16:35:33,593][__main__][INFO] - Starting iteration 160. [2026-03-25 16:35:33,598][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 16:35:33,599][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:35:47,745][__main__][INFO] - Number of regex retries in iteration 160: 0 [2026-03-25 16:35:47,747][__main__][INFO] - agents played in iteration 160 are Bob, Alice [2026-03-25 16:35:48,220][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:35:48,283][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:35:48,284][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:35:48,284][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:35:49,129][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:35:49,746][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:35:50,404][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:35:51,061][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:35:51,719][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:35:52,378][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:35:53,035][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:35:53,692][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:35:54,350][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:35:55,007][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:35:55,665][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:35:56,321][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:35:56,979][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:35:57,639][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:35:58,296][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:35:58,953][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:35:59,610][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:36:00,267][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:36:00,924][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:36:01,583][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:36:02,241][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:36:02,899][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:36:03,556][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:36:04,213][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:36:04,871][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:36:05,530][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:36:06,187][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:36:06,844][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:36:07,501][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:36:08,158][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:36:08,816][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:36:09,473][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:36:10,130][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:36:10,789][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:36:11,447][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:36:12,104][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:36:12,763][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:36:13,422][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:36:14,080][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:36:14,738][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:36:15,395][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:36:16,053][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:36:16,710][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:36:17,367][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:36:18,025][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:36:18,682][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:36:19,340][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:36:19,997][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:36:20,982][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:36:21,642][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:36:22,299][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:36:22,959][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:36:23,616][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:36:24,274][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:36:24,933][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:36:25,591][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:36:26,249][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:36:26,907][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:36:27,565][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:36:28,222][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:36:28,880][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:36:29,538][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:36:30,195][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:36:30,853][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:36:31,510][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:36:32,300][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 16:36:34,000][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:36:34,003][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:36:34,004][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:36:35,487][__main__][INFO] - Iteration 161 took 1m 1s (22.86% Gen, 74.74% Train). Generation: 14s, Training: 46s. Estimated remaining time: 14h 41m 1s. Estimated total time: 17h 11m 31s. Time estimates for 10 more iterations: 10m 18s, 100 more iterations: 1h 43m 9s, 500 more iterations: 8h 35m 45s. [2026-03-25 16:36:35,490][__main__][INFO] - Starting iteration 161. [2026-03-25 16:36:35,493][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 16:36:35,493][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:36:42,685][__main__][INFO] - Number of regex retries in iteration 161: 0 [2026-03-25 16:36:42,686][__main__][INFO] - agents played in iteration 161 are Bob, Alice [2026-03-25 16:36:43,255][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:36:43,316][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:36:43,316][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:36:43,317][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:36:43,978][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:36:44,586][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:36:45,245][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:36:45,902][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:36:46,559][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:36:47,218][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:36:47,875][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:36:48,532][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:36:49,188][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:36:49,845][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:36:50,501][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:36:51,158][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:36:51,814][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:36:52,471][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:36:53,128][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:36:53,785][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:36:54,442][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:36:55,099][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:36:55,757][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:36:56,414][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:36:57,072][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:36:57,730][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:36:58,387][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:36:59,046][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:36:59,704][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:37:00,364][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:37:01,021][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:37:01,679][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:37:02,336][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:37:02,993][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:37:03,651][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:37:04,308][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:37:04,965][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:37:05,623][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:37:06,280][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:37:06,937][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:37:07,594][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:37:08,252][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:37:08,911][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:37:09,569][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:37:10,227][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:37:10,885][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:37:11,542][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:37:12,199][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:37:12,857][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:37:13,514][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:37:14,172][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:37:14,829][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:37:15,816][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:37:16,477][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:37:17,133][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:37:17,792][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:37:18,449][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:37:19,109][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:37:19,768][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:37:20,426][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:37:21,083][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:37:21,742][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:37:22,401][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:37:23,059][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:37:23,718][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:37:24,375][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:37:25,032][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:37:25,690][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:37:26,347][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:37:27,122][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 16:37:28,520][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:37:28,523][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:37:28,524][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:37:30,024][__main__][INFO] - Iteration 162 took 54s (13.19% Gen, 84.06% Train). Generation: 7s, Training: 45s. Estimated remaining time: 12h 37m 27s. Estimated total time: 15h 8m 52s. Time estimates for 10 more iterations: 9m 5s, 100 more iterations: 1h 30m 53s, 500 more iterations: 7h 34m 26s. [2026-03-25 16:37:30,026][__main__][INFO] - Starting iteration 162. [2026-03-25 16:37:30,030][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 16:37:30,031][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:37:36,121][__main__][INFO] - Number of regex retries in iteration 162: 0 [2026-03-25 16:37:36,122][__main__][INFO] - agents played in iteration 162 are Bob, Alice [2026-03-25 16:37:36,589][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:37:36,649][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:37:36,649][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:37:36,650][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:37:37,574][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:37:38,183][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:37:38,840][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:37:39,499][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:37:40,158][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:37:40,816][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:37:41,477][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:37:42,135][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:37:42,795][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:37:43,456][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:37:44,114][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:37:44,773][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:37:45,432][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:37:46,089][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:37:46,746][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:37:47,406][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:37:48,064][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:37:48,722][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:37:49,380][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:37:50,040][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:37:50,697][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:37:51,354][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:37:52,012][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:37:52,669][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:37:53,327][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:37:53,986][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:37:54,648][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:37:55,310][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:37:55,965][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:37:56,624][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:37:57,284][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:37:57,942][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:37:58,600][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:37:59,257][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:37:59,914][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:38:00,572][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:38:01,229][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:38:01,886][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:38:02,544][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:38:03,201][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:38:03,859][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:38:04,516][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:38:05,175][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:38:05,833][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:38:06,491][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:38:07,149][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:38:07,807][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:38:08,465][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:38:09,476][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:38:10,135][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:38:10,794][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:38:11,451][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:38:12,109][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:38:12,768][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:38:13,428][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:38:14,086][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:38:14,743][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:38:15,403][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:38:16,060][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:38:16,719][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:38:17,376][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:38:18,034][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:38:18,692][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:38:19,351][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:38:20,010][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:38:20,709][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 16:38:22,592][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:38:22,595][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:38:22,596][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:38:24,182][__main__][INFO] - Iteration 163 took 54s (11.25% Gen, 85.82% Train). Generation: 6s, Training: 46s. Estimated remaining time: 12h 30m 15s. Estimated total time: 15h 2m 33s. Time estimates for 10 more iterations: 9m 1s, 100 more iterations: 1h 30m 15s, 500 more iterations: 7h 31m 16s. [2026-03-25 16:38:24,184][__main__][INFO] - Starting iteration 163. [2026-03-25 16:38:24,189][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 16:38:24,190][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:38:40,026][__main__][INFO] - Number of regex retries in iteration 163: 0 [2026-03-25 16:38:40,028][__main__][INFO] - agents played in iteration 163 are Bob, Alice [2026-03-25 16:38:41,066][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:38:41,126][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:38:41,127][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:38:41,127][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:38:41,887][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:38:42,496][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:38:43,158][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:38:43,816][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:38:44,473][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:38:45,130][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:38:45,787][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:38:46,444][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:38:47,102][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:38:47,760][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:38:48,417][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:38:49,075][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:38:49,732][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:38:50,389][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:38:51,046][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:38:51,703][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:38:52,360][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:38:53,017][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:38:53,674][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:38:54,331][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:38:54,988][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:38:55,645][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:38:56,302][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:38:56,959][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:38:57,617][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:38:58,274][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:38:58,932][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:38:59,589][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:39:00,247][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:39:00,904][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:39:01,561][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:39:02,219][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:39:02,877][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:39:03,534][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:39:04,191][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:39:04,847][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:39:05,505][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:39:06,162][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:39:06,819][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:39:07,477][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:39:08,134][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:39:08,791][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:39:09,449][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:39:10,106][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:39:10,764][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:39:11,422][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:39:12,079][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:39:12,736][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:39:13,769][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:39:14,427][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:39:15,087][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:39:15,744][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:39:16,402][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:39:17,059][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:39:17,718][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:39:18,376][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:39:19,036][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:39:19,694][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:39:20,351][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:39:21,009][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:39:21,667][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:39:22,324][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:39:22,982][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:39:23,640][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:39:24,297][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:39:25,196][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 16:39:26,812][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:39:26,815][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:39:26,816][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:39:28,295][__main__][INFO] - Iteration 164 took 1m 4s (24.71% Gen, 72.98% Train). Generation: 15s, Training: 46s. Estimated remaining time: 15h 15m 5s. Estimated total time: 17h 48m 28s. Time estimates for 10 more iterations: 10m 41s, 100 more iterations: 1h 46m 50s, 500 more iterations: 8h 54m 14s. [2026-03-25 16:39:28,297][__main__][INFO] - Starting iteration 164. [2026-03-25 16:39:28,301][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 16:39:28,302][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:39:44,357][__main__][INFO] - Number of regex retries in iteration 164: 0 [2026-03-25 16:39:44,359][__main__][INFO] - agents played in iteration 164 are Bob, Alice [2026-03-25 16:39:44,930][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:39:44,991][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:39:44,992][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:39:44,992][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:39:45,856][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:39:46,468][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:39:47,126][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:39:47,783][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:39:48,440][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:39:49,096][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:39:49,752][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:39:50,410][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:39:51,068][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:39:51,725][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:39:52,381][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:39:53,037][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:39:53,694][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:39:54,352][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:39:55,010][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:39:55,666][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:39:56,324][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:39:56,981][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:39:57,637][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:39:58,295][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:39:58,995][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:39:59,651][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:40:00,308][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:40:00,965][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:40:01,623][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:40:02,280][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:40:02,938][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:40:03,595][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:40:04,253][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:40:08,510][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:40:09,166][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:40:09,823][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:40:10,479][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:40:11,136][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:40:11,793][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:40:12,451][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:40:13,108][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:40:13,766][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:40:14,424][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:40:15,082][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:40:15,740][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:40:16,397][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:40:17,054][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:40:17,713][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:40:18,371][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:40:19,028][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:40:19,686][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:40:20,343][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:40:21,342][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:40:22,000][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:40:22,658][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:40:23,315][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:40:23,973][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:40:24,632][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:40:25,290][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:40:25,947][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:40:26,604][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:40:27,262][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:40:27,920][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:40:28,578][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:40:29,242][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:40:29,901][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:40:30,562][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:40:31,221][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:40:31,880][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:40:32,681][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:46 [2026-03-25 16:40:34,646][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:40:34,648][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:40:34,650][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:40:36,197][__main__][INFO] - Iteration 165 took 1m 7s (23.65% Gen, 74.07% Train). Generation: 16s, Training: 50s. Estimated remaining time: 16h 17m 6s. Estimated total time: 18h 51m 37s. Time estimates for 10 more iterations: 11m 18s, 100 more iterations: 1h 53m 9s, 500 more iterations: 9h 25m 48s. [2026-03-25 16:40:36,200][__main__][INFO] - Starting iteration 165. [2026-03-25 16:40:36,205][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 16:40:36,205][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:40:41,272][__main__][INFO] - Number of regex retries in iteration 165: 0 [2026-03-25 16:40:41,274][__main__][INFO] - agents played in iteration 165 are Bob, Alice [2026-03-25 16:40:41,733][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:40:41,794][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:40:41,795][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:40:41,795][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:40:42,457][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:40:43,063][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:40:43,721][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:40:44,377][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:40:45,037][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:40:45,695][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:40:46,354][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:40:47,012][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:40:47,669][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:40:48,326][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:40:48,984][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:40:49,641][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:40:50,298][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:40:50,956][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:40:51,614][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:40:52,271][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:40:52,929][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:40:53,586][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:40:54,243][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:40:54,901][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:40:55,559][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:40:56,216][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:40:56,873][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:40:57,530][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:40:58,189][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:40:58,846][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:40:59,505][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:41:00,162][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:41:00,819][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:41:01,477][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:41:02,134][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:41:02,792][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:41:03,450][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:41:04,107][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:41:04,765][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:41:05,422][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:41:06,079][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:41:06,738][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:41:07,396][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:41:08,053][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:41:08,710][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:41:09,367][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:41:10,025][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:41:10,682][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:41:11,340][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:41:11,997][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:41:12,654][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:41:13,312][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:41:14,297][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:41:14,955][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:41:15,612][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:41:16,270][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:41:16,928][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:41:17,585][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:41:18,243][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:41:18,900][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:41:19,558][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:41:20,215][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:41:20,872][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:41:21,529][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:41:22,186][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:41:22,843][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:41:23,500][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:41:24,157][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:41:24,814][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:41:25,569][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 16:41:27,495][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:41:27,498][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:41:27,499][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:41:29,461][__main__][INFO] - Iteration 166 took 53s (9.52% Gen, 86.80% Train). Generation: 5s, Training: 46s. Estimated remaining time: 12h 12m 13s. Estimated total time: 14h 47m 37s. Time estimates for 10 more iterations: 8m 52s, 100 more iterations: 1h 28m 45s, 500 more iterations: 7h 23m 48s. [2026-03-25 16:41:29,463][__main__][INFO] - Starting iteration 166. [2026-03-25 16:41:29,467][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 16:41:29,467][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:41:34,466][__main__][INFO] - Number of regex retries in iteration 166: 0 [2026-03-25 16:41:34,467][__main__][INFO] - agents played in iteration 166 are Bob, Alice [2026-03-25 16:41:35,044][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:41:35,104][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:41:35,105][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:41:35,106][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:41:35,973][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:41:36,584][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:41:37,242][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:41:37,899][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:41:38,558][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:41:39,216][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:41:39,874][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:41:40,532][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:41:41,189][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:41:41,848][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:41:42,504][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:41:43,161][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:41:43,819][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:41:44,476][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:41:45,135][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:41:45,794][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:41:46,451][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:41:47,109][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:41:47,766][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:41:48,423][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:41:49,081][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:41:49,739][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:41:50,396][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:41:51,054][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:41:51,713][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:41:52,371][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:41:53,028][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:41:53,686][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:41:54,343][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:41:55,000][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:41:55,657][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:41:56,314][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:41:56,972][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:41:57,630][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:41:58,287][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:41:58,945][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:41:59,602][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:42:00,259][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:42:00,916][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:42:01,573][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:42:02,230][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:42:02,889][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:42:03,547][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:42:04,204][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:42:04,861][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:42:05,518][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:42:06,175][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:42:06,832][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:42:07,819][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:42:08,479][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:42:09,138][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:42:09,797][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:42:10,453][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:42:11,111][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:42:11,770][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:42:12,431][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:42:13,092][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:42:13,747][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:42:14,406][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:42:15,065][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:42:15,725][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:42:16,382][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:42:17,040][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:42:17,698][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:42:18,356][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:42:19,154][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 16:42:21,101][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:42:21,104][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:42:21,105][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:42:22,554][__main__][INFO] - Iteration 167 took 53s (9.42% Gen, 87.85% Train). Generation: 4s, Training: 46s. Estimated remaining time: 12h 8m 31s. Estimated total time: 14h 44m 48s. Time estimates for 10 more iterations: 8m 50s, 100 more iterations: 1h 28m 28s, 500 more iterations: 7h 22m 24s. [2026-03-25 16:42:22,556][__main__][INFO] - Starting iteration 167. [2026-03-25 16:42:22,560][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 16:42:22,561][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:42:29,347][__main__][INFO] - Number of regex retries in iteration 167: 0 [2026-03-25 16:42:29,349][__main__][INFO] - agents played in iteration 167 are Bob, Alice [2026-03-25 16:42:30,171][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:42:30,237][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:42:30,238][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:42:30,239][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:42:30,982][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:42:31,600][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:42:32,258][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:42:32,918][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:42:33,576][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:42:34,234][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:42:34,892][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:42:35,550][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:42:36,207][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:42:36,865][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:42:37,524][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:42:38,181][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:42:38,839][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:42:39,496][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:42:40,154][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:42:40,811][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:42:41,471][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:42:42,130][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:42:42,789][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:42:43,448][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:42:44,111][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:42:44,765][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:42:45,424][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:42:46,083][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:42:46,740][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:42:47,398][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:42:48,056][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:42:48,713][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:42:49,371][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:42:50,028][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:42:50,686][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:42:51,344][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:42:52,006][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:42:52,662][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:42:53,321][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:42:53,982][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:42:54,640][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:42:55,298][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:42:55,955][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:42:56,613][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:42:57,271][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:42:57,928][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:42:58,585][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:42:59,243][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:42:59,901][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:43:00,558][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:43:01,216][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:43:01,876][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:43:02,869][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:43:03,530][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:43:04,188][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:43:04,846][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:43:05,507][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:43:06,165][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:43:06,822][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:43:07,480][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:43:08,138][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:43:08,795][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:43:09,453][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:43:10,110][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:43:10,767][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:43:11,425][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:43:12,083][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:43:12,740][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:43:13,397][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:43:14,293][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 16:43:16,227][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:43:16,230][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:43:16,232][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:43:17,775][__main__][INFO] - Iteration 168 took 55s (12.29% Gen, 84.91% Train). Generation: 6s, Training: 46s. Estimated remaining time: 12h 43m 4s. Estimated total time: 15h 20m 16s. Time estimates for 10 more iterations: 9m 12s, 100 more iterations: 1h 32m 1s, 500 more iterations: 7h 40m 8s. [2026-03-25 16:43:17,777][__main__][INFO] - Starting iteration 168. [2026-03-25 16:43:17,784][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 16:43:17,785][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:43:27,549][__main__][INFO] - Number of regex retries in iteration 168: 0 [2026-03-25 16:43:27,551][__main__][INFO] - agents played in iteration 168 are Bob, Alice [2026-03-25 16:43:28,648][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:43:28,710][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:43:28,711][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:43:28,711][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:43:29,519][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:43:30,150][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:43:30,807][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:43:31,464][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:43:32,121][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:43:32,781][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:43:33,439][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:43:34,096][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:43:34,763][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:43:35,421][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:43:36,080][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:43:36,737][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:43:37,394][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:43:38,053][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:43:38,710][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:43:39,367][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:43:40,026][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:43:40,684][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:43:41,342][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:43:42,000][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:43:42,657][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:43:43,314][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:43:43,972][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:43:44,630][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:43:45,287][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:43:45,945][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:43:46,603][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:43:47,260][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:43:47,918][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:43:48,576][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:43:49,233][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:43:49,892][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:43:50,549][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:43:51,206][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:43:51,864][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:43:52,520][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:43:53,177][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:43:53,834][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:43:54,491][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:43:55,148][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:43:55,805][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:43:56,462][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:43:57,119][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:43:57,776][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:43:58,434][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:43:59,092][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:43:59,749][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:44:00,406][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:44:01,393][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:44:02,051][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:44:02,709][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:44:03,367][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:44:04,025][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:44:04,684][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:44:05,341][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:44:05,998][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:44:06,655][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:44:07,313][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:44:07,970][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:44:08,627][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:44:09,285][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:44:09,942][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:44:10,600][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:44:11,258][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:44:11,914][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:44:12,807][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 16:44:14,781][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:44:14,783][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:44:14,785][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:44:16,347][__main__][INFO] - Iteration 169 took 58s (16.68% Gen, 80.65% Train). Generation: 9s, Training: 47s. Estimated remaining time: 13h 37m 53s. Estimated total time: 16h 16m 4s. Time estimates for 10 more iterations: 9m 45s, 100 more iterations: 1h 37m 36s, 500 more iterations: 8h 8m 2s. [2026-03-25 16:44:16,349][__main__][INFO] - Starting iteration 169. [2026-03-25 16:44:16,354][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 16:44:16,354][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:44:21,155][__main__][INFO] - Number of regex retries in iteration 169: 0 [2026-03-25 16:44:21,156][__main__][INFO] - agents played in iteration 169 are Bob, Alice [2026-03-25 16:44:21,728][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:44:21,790][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:44:21,790][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:44:21,791][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:44:22,496][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:44:23,103][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:44:23,761][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:44:24,417][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:44:25,073][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:44:25,732][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:44:26,389][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:44:27,045][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:44:27,703][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:44:28,361][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:44:29,019][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:44:29,677][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:44:30,334][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:44:30,992][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:44:31,649][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:44:32,307][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:44:32,964][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:44:33,622][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:44:34,280][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:44:34,937][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:44:35,594][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:44:36,255][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:44:36,914][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:44:37,572][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:44:38,231][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:44:38,889][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:44:39,548][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:44:40,207][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:44:40,864][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:44:41,523][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:44:42,180][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:44:42,838][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:44:43,496][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:44:44,153][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:44:44,810][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:44:45,469][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:44:46,126][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:44:46,784][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:44:47,443][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:44:48,100][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:44:48,758][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:44:49,415][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:44:50,072][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:44:50,731][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:44:51,388][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:44:52,045][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:44:52,704][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:44:53,362][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:44:54,345][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:44:55,003][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:44:55,661][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:44:56,319][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:44:56,977][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:44:57,637][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:44:58,296][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:44:58,955][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:44:59,615][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:45:00,274][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:45:00,932][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:45:01,591][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:45:02,251][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:45:02,910][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:45:03,566][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:45:04,223][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:45:04,881][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:45:05,774][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 16:45:07,333][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:45:07,335][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:45:07,337][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:45:08,908][__main__][INFO] - Iteration 170 took 52s (9.14% Gen, 87.87% Train). Generation: 4s, Training: 46s. Estimated remaining time: 11h 56m 52s. Estimated total time: 14h 35m 56s. Time estimates for 10 more iterations: 8m 45s, 100 more iterations: 1h 27m 35s, 500 more iterations: 7h 17m 58s. [2026-03-25 16:45:08,910][__main__][INFO] - Starting iteration 170. [2026-03-25 16:45:08,914][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 16:45:08,914][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:45:17,022][__main__][INFO] - Number of regex retries in iteration 170: 0 [2026-03-25 16:45:17,023][__main__][INFO] - agents played in iteration 170 are Bob, Alice [2026-03-25 16:45:17,509][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:45:17,572][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:45:17,573][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:45:17,573][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:45:18,293][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:45:18,897][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:45:19,554][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:45:20,211][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:45:20,868][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:45:21,525][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:45:22,182][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:45:22,840][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:45:23,498][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:45:24,155][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:45:24,812][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:45:25,469][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:45:26,126][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:45:26,783][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:45:27,440][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:45:28,097][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:45:28,759][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:45:29,417][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:45:30,072][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:45:30,728][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:45:31,385][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:45:32,041][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:45:32,698][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:45:33,355][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:45:34,013][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:45:34,670][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:45:35,330][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:45:35,993][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:45:36,651][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:45:37,310][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:45:37,967][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:45:38,625][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:45:39,284][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:45:39,942][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:45:40,600][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:45:41,258][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:45:41,915][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:45:42,574][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:45:43,231][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:45:43,890][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:45:44,547][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:45:45,205][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:45:45,862][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:45:46,519][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:45:47,178][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:45:47,835][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:45:48,492][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:45:49,151][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:45:50,163][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:45:50,823][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:45:51,480][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:45:52,139][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:45:52,797][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:45:53,454][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:45:54,111][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:45:54,770][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:45:55,429][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:45:56,086][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:45:56,743][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:45:57,401][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:45:58,058][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:45:58,717][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:45:59,375][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:46:00,032][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:46:00,689][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:46:01,609][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 16:46:03,575][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:46:03,578][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:46:03,579][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:46:05,326][__main__][INFO] - Iteration 171 took 56s (14.37% Gen, 82.52% Train). Generation: 8s, Training: 46s. Estimated remaining time: 13h 0m 14s. Estimated total time: 15h 40m 14s. Time estimates for 10 more iterations: 9m 24s, 100 more iterations: 1h 34m 1s, 500 more iterations: 7h 50m 7s. [2026-03-25 16:46:05,328][__main__][INFO] - Starting iteration 171. [2026-03-25 16:46:05,333][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 16:46:05,333][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:46:13,029][__main__][INFO] - Number of regex retries in iteration 171: 0 [2026-03-25 16:46:13,030][__main__][INFO] - agents played in iteration 171 are Bob, Alice [2026-03-25 16:46:13,607][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:46:13,668][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:46:13,669][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:46:13,669][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:46:14,540][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:46:15,156][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:46:15,816][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:46:16,473][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:46:17,131][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:46:17,789][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:46:18,447][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:46:19,105][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:46:19,766][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:46:20,424][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:46:21,082][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:46:21,740][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:46:22,398][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:46:23,056][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:46:23,715][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:46:24,375][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:46:25,032][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:46:25,690][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:46:26,349][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:46:27,009][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:46:27,666][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:46:28,324][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:46:28,982][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:46:29,640][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:46:30,298][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:46:30,957][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:46:31,614][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:46:32,272][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:46:32,929][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:46:33,587][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:46:34,245][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:46:34,902][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:46:35,560][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:46:36,217][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:46:36,874][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:46:37,532][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:46:38,189][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:46:38,846][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:46:39,503][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:46:40,160][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:46:40,818][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:46:41,476][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:46:42,133][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:46:42,792][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:46:43,449][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:46:44,105][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:46:44,762][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:46:45,420][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:46:46,408][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:46:47,068][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:46:47,726][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:46:48,383][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:46:49,042][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:46:49,699][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:46:50,356][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:46:51,013][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:46:51,671][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:46:52,328][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:46:52,985][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:46:53,642][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:46:54,300][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:46:54,957][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:46:55,614][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:46:56,271][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:46:56,929][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:46:57,675][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 16:46:59,067][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:46:59,070][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:46:59,072][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:47:00,772][__main__][INFO] - Iteration 172 took 55s (13.88% Gen, 83.04% Train). Generation: 7s, Training: 46s. Estimated remaining time: 12h 43m 6s. Estimated total time: 15h 24m 1s. Time estimates for 10 more iterations: 9m 14s, 100 more iterations: 1h 32m 24s, 500 more iterations: 7h 42m 0s. [2026-03-25 16:47:00,774][__main__][INFO] - Starting iteration 172. [2026-03-25 16:47:00,778][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 16:47:00,779][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:47:09,172][__main__][INFO] - Number of regex retries in iteration 172: 0 [2026-03-25 16:47:09,173][__main__][INFO] - agents played in iteration 172 are Bob, Alice [2026-03-25 16:47:10,144][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:47:10,205][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:47:10,206][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:47:10,206][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:47:11,068][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:47:11,684][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:47:12,341][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:47:12,998][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:47:13,655][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:47:14,312][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:47:14,970][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:47:15,628][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:47:16,285][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:47:16,943][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:47:17,601][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:47:18,258][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:47:18,916][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:47:19,573][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:47:20,231][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:47:20,888][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:47:21,545][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:47:22,201][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:47:22,862][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:47:23,520][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:47:24,178][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:47:24,836][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:47:25,494][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:47:26,152][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:47:26,810][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:47:27,466][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:47:28,123][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:47:28,781][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:47:29,438][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:47:30,097][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:47:30,755][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:47:31,412][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:47:32,070][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:47:32,728][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:47:33,386][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:47:34,044][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:47:34,701][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:47:35,358][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:47:36,015][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:47:36,672][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:47:37,329][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:47:37,987][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:47:38,644][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:47:39,301][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:47:39,958][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:47:40,616][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:47:41,274][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:47:41,932][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:47:42,917][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:47:43,575][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:47:44,231][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:47:44,890][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:47:45,548][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:47:46,205][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:47:46,862][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:47:47,520][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:47:48,178][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:47:48,837][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:47:49,497][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:47:50,152][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:47:50,809][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:47:51,467][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:47:52,124][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:47:52,782][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:47:53,442][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:47:54,254][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 16:47:55,655][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:47:55,658][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:47:55,659][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:47:57,370][__main__][INFO] - Iteration 173 took 56s (14.83% Gen, 82.14% Train). Generation: 8s, Training: 46s. Estimated remaining time: 13h 1m 21s. Estimated total time: 15h 43m 13s. Time estimates for 10 more iterations: 9m 25s, 100 more iterations: 1h 34m 19s, 500 more iterations: 7h 51m 36s. [2026-03-25 16:47:57,372][__main__][INFO] - Starting iteration 173. [2026-03-25 16:47:57,378][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 16:47:57,378][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:48:10,940][__main__][INFO] - Number of regex retries in iteration 173: 0 [2026-03-25 16:48:10,941][__main__][INFO] - agents played in iteration 173 are Bob, Alice [2026-03-25 16:48:11,983][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:48:12,044][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:48:12,045][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:48:12,046][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:48:12,820][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:48:13,438][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:48:14,094][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:48:14,750][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:48:15,409][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:48:16,067][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:48:16,724][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:48:17,381][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:48:18,038][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:48:18,694][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:48:19,350][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:48:20,007][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:48:20,664][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:48:21,321][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:48:21,978][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:48:22,635][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:48:23,292][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:48:23,949][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:48:24,605][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:48:25,263][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:48:25,920][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:48:26,578][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:48:27,235][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:48:27,892][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:48:28,550][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:48:29,208][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:48:29,865][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:48:30,523][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:48:31,180][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:48:31,840][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:48:32,495][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:48:33,152][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:48:33,810][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:48:34,467][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:48:35,125][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:48:35,782][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:48:36,439][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:48:37,096][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:48:37,753][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:48:38,410][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:48:39,067][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:48:39,724][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:48:40,382][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:48:41,039][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:48:41,696][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:48:42,354][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:48:43,010][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:48:43,668][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:48:44,659][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:48:45,317][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:48:45,974][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:48:46,631][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:48:47,289][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:48:47,945][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:48:48,604][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:48:49,262][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:48:49,919][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:48:50,576][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:48:51,234][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:48:51,892][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:48:52,549][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:48:53,207][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:48:53,866][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:48:54,523][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:48:55,181][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:48:55,943][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 16:48:57,313][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:48:57,316][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:48:57,317][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:48:59,096][__main__][INFO] - Iteration 174 took 1m 1s (21.97% Gen, 75.14% Train). Generation: 13s, Training: 46s. Estimated remaining time: 14h 25m 45s. Estimated total time: 17h 8m 39s. Time estimates for 10 more iterations: 10m 17s, 100 more iterations: 1h 42m 51s, 500 more iterations: 8h 34m 19s. [2026-03-25 16:48:59,098][__main__][INFO] - Starting iteration 174. [2026-03-25 16:48:59,102][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 16:48:59,103][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:49:04,728][__main__][INFO] - Number of regex retries in iteration 174: 0 [2026-03-25 16:49:04,729][__main__][INFO] - agents played in iteration 174 are Bob, Alice [2026-03-25 16:49:05,187][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:49:05,248][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:49:05,248][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:49:05,249][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:49:06,016][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:49:06,637][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:49:07,295][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:49:07,951][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:49:08,608][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:49:09,265][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:49:09,922][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:49:10,578][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:49:11,234][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:49:11,891][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:49:12,547][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:49:13,204][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:49:13,860][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:49:14,517][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:49:15,179][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:49:15,838][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:49:16,496][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:49:17,153][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:49:17,812][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:49:18,469][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:49:19,130][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:49:19,786][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:49:20,447][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:49:21,106][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:49:21,764][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:49:22,423][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:49:23,080][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:49:23,737][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:49:24,396][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:49:25,055][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:49:25,711][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:49:26,369][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:49:27,028][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:49:27,683][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:49:28,341][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:49:29,002][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:49:29,660][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:49:30,317][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:49:30,975][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:49:31,633][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:49:32,291][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:49:32,948][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:49:33,607][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:49:34,264][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:49:34,921][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:49:35,579][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:49:36,237][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:49:36,895][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:49:37,874][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:49:38,532][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:49:39,189][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:49:39,848][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:49:40,505][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:49:41,163][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:49:41,820][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:49:42,477][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:49:43,134][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:49:43,792][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:49:44,449][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:49:45,107][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:49:45,764][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:49:46,423][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:49:47,080][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:49:47,740][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:49:48,399][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:49:49,180][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 16:49:50,553][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:49:50,556][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:49:50,558][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:49:52,901][__main__][INFO] - Iteration 175 took 53s (10.46% Gen, 85.18% Train). Generation: 5s, Training: 45s. Estimated remaining time: 12h 12m 53s. Estimated total time: 14h 56m 41s. Time estimates for 10 more iterations: 8m 58s, 100 more iterations: 1h 29m 40s, 500 more iterations: 7h 28m 20s. [2026-03-25 16:49:52,904][__main__][INFO] - Starting iteration 175. [2026-03-25 16:49:52,909][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 16:49:52,909][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:49:58,772][__main__][INFO] - Number of regex retries in iteration 175: 0 [2026-03-25 16:49:58,773][__main__][INFO] - agents played in iteration 175 are Bob, Alice [2026-03-25 16:49:59,328][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:49:59,388][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:49:59,389][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:49:59,389][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:50:00,086][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:50:00,711][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:50:01,367][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:50:02,024][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:50:02,680][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:50:03,338][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:50:03,997][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:50:04,653][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:50:05,309][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:50:05,966][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:50:06,623][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:50:07,281][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:50:07,938][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:50:08,595][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:50:09,253][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:50:09,910][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:50:10,567][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:50:11,225][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:50:11,882][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:50:12,538][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:50:13,198][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:50:13,856][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:50:14,512][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:50:15,170][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:50:15,828][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:50:16,486][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:50:17,143][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:50:17,800][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:50:18,458][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:50:19,117][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:50:19,774][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:50:20,431][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:50:21,088][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:50:21,745][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:50:22,402][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:50:23,060][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:50:23,717][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:50:24,375][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:50:25,032][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:50:25,689][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:50:26,347][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:50:27,003][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:50:27,660][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:50:28,318][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:50:28,977][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:50:29,635][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:50:30,292][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:50:30,949][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:50:31,932][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:50:32,589][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:50:33,247][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:50:33,904][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:50:34,561][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:50:35,218][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:50:35,875][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:50:36,532][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:50:37,191][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:50:37,848][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:50:38,505][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:50:39,162][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:50:39,820][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:50:40,477][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:50:41,135][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:50:41,792][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:50:42,451][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:50:43,222][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 16:50:44,794][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:50:44,797][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:50:44,798][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:50:46,313][__main__][INFO] - Iteration 176 took 53s (10.98% Gen, 86.18% Train). Generation: 5s, Training: 46s. Estimated remaining time: 12h 5m 25s. Estimated total time: 14h 50m 6s. Time estimates for 10 more iterations: 8m 54s, 100 more iterations: 1h 29m 0s, 500 more iterations: 7h 25m 3s. [2026-03-25 16:50:46,315][__main__][INFO] - Starting iteration 176. [2026-03-25 16:50:46,318][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 16:50:46,319][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:50:51,306][__main__][INFO] - Number of regex retries in iteration 176: 0 [2026-03-25 16:50:51,307][__main__][INFO] - agents played in iteration 176 are Bob, Alice [2026-03-25 16:50:51,765][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:50:51,826][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:50:51,827][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:50:51,827][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:50:52,583][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:50:53,214][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:50:53,872][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:50:54,528][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:50:55,184][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:50:55,840][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:50:56,496][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:50:57,155][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:50:57,809][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:50:58,473][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:50:59,133][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:50:59,789][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:51:00,447][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:51:01,106][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:51:01,766][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:51:02,424][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:51:03,082][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:51:03,741][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:51:04,400][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:51:05,060][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:51:05,720][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:51:06,379][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:51:07,040][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:51:07,696][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:51:08,353][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:51:09,012][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:51:09,669][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:51:10,327][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:51:10,985][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:51:11,643][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:51:12,300][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:51:12,957][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:51:13,615][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:51:14,272][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:51:14,929][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:51:15,586][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:51:16,244][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:51:16,903][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:51:17,560][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:51:18,217][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:51:18,875][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:51:19,532][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:51:20,189][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:51:20,847][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:51:21,506][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:51:22,162][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:51:22,820][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:51:23,477][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:51:24,459][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:51:25,117][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:51:25,775][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:51:26,432][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:51:27,091][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:51:27,747][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:51:28,405][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:51:29,063][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:51:29,720][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:51:30,377][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:51:31,035][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:51:31,692][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:51:32,350][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:51:33,007][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:51:33,664][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:51:34,321][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:51:34,978][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:51:35,752][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 16:51:37,315][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:51:37,318][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:51:37,319][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:51:38,785][__main__][INFO] - Iteration 177 took 52s (9.51% Gen, 87.69% Train). Generation: 4s, Training: 46s. Estimated remaining time: 11h 48m 55s. Estimated total time: 14h 34m 28s. Time estimates for 10 more iterations: 8m 44s, 100 more iterations: 1h 27m 26s, 500 more iterations: 7h 17m 14s. [2026-03-25 16:51:38,788][__main__][INFO] - Starting iteration 177. [2026-03-25 16:51:38,792][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 16:51:38,792][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:51:44,962][__main__][INFO] - Number of regex retries in iteration 177: 0 [2026-03-25 16:51:44,964][__main__][INFO] - agents played in iteration 177 are Bob, Alice [2026-03-25 16:51:45,419][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:51:45,481][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:51:45,481][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:51:45,482][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:51:46,222][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:51:46,839][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:51:47,498][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:51:48,154][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:51:48,813][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:51:49,472][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:51:50,129][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:51:50,786][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:51:51,444][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:51:52,102][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:51:52,759][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:51:53,416][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:51:54,075][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:51:54,729][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:51:55,386][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:51:56,043][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:51:56,701][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:51:57,359][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:51:58,017][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:51:58,675][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:51:59,332][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:51:59,992][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:52:00,652][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:52:01,311][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:52:01,969][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:52:02,628][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:52:03,286][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:52:03,944][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:52:04,603][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:52:05,262][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:52:05,922][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:52:06,580][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:52:07,239][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:52:07,898][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:52:08,557][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:52:09,217][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:52:09,877][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:52:10,539][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:52:11,205][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:52:11,866][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:52:12,524][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:52:13,181][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:52:13,838][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:52:14,497][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:52:15,152][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:52:15,809][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:52:16,468][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:52:17,124][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:52:18,117][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:52:18,773][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:52:19,430][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:52:20,086][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:52:20,744][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:52:21,401][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:52:22,058][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:52:22,715][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:52:23,376][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:52:24,033][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:52:24,696][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:52:25,355][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:52:26,011][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:52:26,668][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:52:27,325][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:52:27,984][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:52:28,639][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:52:29,429][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 16:52:30,790][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:52:30,793][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:52:30,794][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:52:32,381][__main__][INFO] - Iteration 178 took 53s (11.52% Gen, 85.52% Train). Generation: 6s, Training: 45s. Estimated remaining time: 12h 6m 43s. Estimated total time: 14h 53m 10s. Time estimates for 10 more iterations: 8m 55s, 100 more iterations: 1h 29m 19s, 500 more iterations: 7h 26m 35s. [2026-03-25 16:52:32,383][__main__][INFO] - Starting iteration 178. [2026-03-25 16:52:32,388][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 16:52:32,389][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:52:40,138][__main__][INFO] - Number of regex retries in iteration 178: 0 [2026-03-25 16:52:40,139][__main__][INFO] - agents played in iteration 178 are Bob, Alice [2026-03-25 16:52:41,182][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:52:41,244][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:52:41,244][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:52:41,245][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:52:42,028][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:52:42,640][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:52:43,299][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:52:43,958][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:52:44,615][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:52:45,273][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:52:45,936][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:52:46,594][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:52:47,252][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:52:47,910][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:52:48,568][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:52:49,231][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:52:49,888][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:52:50,545][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:52:51,204][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:52:51,862][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:52:52,520][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:52:53,180][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:52:53,838][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:52:54,496][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:52:55,155][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:52:55,815][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:52:56,474][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:52:57,133][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:52:57,793][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:52:58,452][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:52:59,110][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:52:59,769][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:53:00,427][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:53:01,085][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:53:01,744][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:53:02,402][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:53:03,060][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:53:03,717][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:53:04,376][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:53:05,034][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:53:05,692][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:53:06,350][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:53:07,007][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:53:07,666][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:53:08,323][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:53:08,982][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:53:09,640][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:53:10,300][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:53:10,957][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:53:11,616][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:53:12,274][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:53:12,933][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:53:13,957][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:53:14,612][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:53:15,271][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:53:15,929][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:53:16,587][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:53:17,245][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:53:17,902][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:53:18,559][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:53:19,216][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:53:19,873][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:53:20,531][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:53:21,188][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:53:21,846][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:53:22,503][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:53:23,160][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:53:23,818][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:53:24,475][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:53:25,251][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 16:53:26,604][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:53:26,606][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:53:26,608][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:53:28,172][__main__][INFO] - Iteration 179 took 55s (13.89% Gen, 83.30% Train). Generation: 7s, Training: 46s. Estimated remaining time: 12h 42m 23s. Estimated total time: 15h 29m 46s. Time estimates for 10 more iterations: 9m 17s, 100 more iterations: 1h 32m 58s, 500 more iterations: 7h 44m 53s. [2026-03-25 16:53:28,174][__main__][INFO] - Starting iteration 179. [2026-03-25 16:53:28,178][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 16:53:28,178][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:53:37,605][__main__][INFO] - Number of regex retries in iteration 179: 0 [2026-03-25 16:53:37,606][__main__][INFO] - agents played in iteration 179 are Bob, Alice [2026-03-25 16:53:38,074][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:53:38,135][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:53:38,136][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:53:38,137][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:53:38,969][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:53:39,588][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:53:40,245][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:53:40,901][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:53:41,558][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:53:42,214][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:53:42,870][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:53:43,526][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:53:44,183][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:53:44,842][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:53:45,499][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:53:46,156][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:53:46,812][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:53:47,469][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:53:48,127][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:53:48,786][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:53:49,442][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:53:50,099][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:53:50,756][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:53:51,413][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:53:52,069][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:53:52,726][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:53:53,383][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:53:54,040][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:53:54,698][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:53:55,355][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:53:56,013][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:53:56,671][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:53:57,328][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:53:57,985][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:53:58,644][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:53:59,299][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:53:59,958][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:54:00,615][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:54:01,270][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:54:01,927][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:54:02,584][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:54:03,241][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:54:03,897][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:54:04,554][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:54:05,211][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:54:05,868][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:54:06,527][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:54:07,182][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:54:07,840][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:54:08,498][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:54:09,156][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:54:09,813][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:54:10,828][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:54:11,486][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:54:12,144][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:54:12,801][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:54:13,460][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:54:14,117][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:54:14,774][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:54:15,431][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:54:16,088][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:54:16,745][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:54:17,404][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:54:18,063][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:54:18,718][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:54:19,376][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:54:20,033][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:54:20,691][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:54:21,349][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:54:22,157][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 16:54:23,738][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:54:23,741][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:54:23,743][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:54:25,468][__main__][INFO] - Iteration 180 took 57s (16.46% Gen, 80.53% Train). Generation: 9s, Training: 46s. Estimated remaining time: 13h 6m 32s. Estimated total time: 15h 54m 52s. Time estimates for 10 more iterations: 9m 32s, 100 more iterations: 1h 35m 29s, 500 more iterations: 7h 57m 26s. [2026-03-25 16:54:25,472][__main__][INFO] - Starting iteration 180. [2026-03-25 16:54:25,477][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 16:54:25,478][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:54:30,262][__main__][INFO] - Number of regex retries in iteration 180: 0 [2026-03-25 16:54:30,263][__main__][INFO] - agents played in iteration 180 are Bob, Alice [2026-03-25 16:54:30,718][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:54:30,781][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:54:30,781][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:54:30,782][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:54:31,549][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:54:32,163][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:54:32,820][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:54:33,477][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:54:34,134][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:54:34,791][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:54:35,448][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:54:36,104][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:54:36,761][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:54:37,419][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:54:38,078][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:54:38,736][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:54:39,397][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:54:40,055][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:54:40,714][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:54:41,371][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:54:42,028][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:54:42,686][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:54:43,343][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:54:44,002][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:54:44,657][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:54:45,316][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:54:45,973][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:54:46,631][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:54:47,288][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:54:47,945][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:54:48,603][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:54:49,260][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:54:49,917][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:54:50,574][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:54:51,231][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:54:51,888][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:54:52,545][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:54:53,202][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:54:53,860][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:54:54,519][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:54:55,176][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:54:55,834][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:54:56,491][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:54:57,149][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:54:57,806][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:54:58,463][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:54:59,120][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:54:59,778][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:55:00,435][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:55:01,093][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:55:01,751][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:55:02,408][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:55:03,386][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:55:04,045][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:55:04,703][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:55:05,360][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:55:06,018][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:55:06,675][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:55:07,334][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:55:07,992][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:55:08,648][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:55:09,307][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:55:09,964][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:55:10,623][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:55:11,281][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:55:11,938][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:55:12,595][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:55:13,253][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:55:13,911][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:55:14,752][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 16:55:16,124][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:55:16,126][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:55:16,127][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:55:17,652][__main__][INFO] - Iteration 181 took 52s (9.17% Gen, 87.90% Train). Generation: 4s, Training: 45s. Estimated remaining time: 11h 40m 25s. Estimated total time: 14h 29m 37s. Time estimates for 10 more iterations: 8m 41s, 100 more iterations: 1h 26m 57s, 500 more iterations: 7h 14m 48s. [2026-03-25 16:55:17,654][__main__][INFO] - Starting iteration 181. [2026-03-25 16:55:17,658][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 16:55:17,659][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:55:25,436][__main__][INFO] - Number of regex retries in iteration 181: 0 [2026-03-25 16:55:25,437][__main__][INFO] - agents played in iteration 181 are Bob, Alice [2026-03-25 16:55:26,024][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:55:26,086][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:55:26,086][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:55:26,087][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:55:26,807][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:55:27,422][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:55:28,081][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:55:28,739][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:55:29,395][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:55:30,052][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:55:30,710][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:55:31,367][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:55:32,023][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:55:32,680][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:55:33,337][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:55:33,994][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:55:34,652][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:55:35,310][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:55:35,969][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:55:36,624][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:55:37,284][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:55:37,942][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:55:38,599][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:55:39,257][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:55:39,915][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:55:40,572][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:55:41,229][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:55:41,887][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:55:42,545][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:55:43,202][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:55:43,861][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:55:44,517][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:55:45,175][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:55:45,833][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:55:46,491][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:55:47,149][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:55:47,809][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:55:48,464][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:55:49,123][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:55:49,778][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:55:50,435][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:55:51,091][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:55:51,749][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:55:52,405][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:55:55,101][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:55:56,772][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:55:57,429][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:55:58,088][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:55:58,746][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:55:59,403][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:56:00,060][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:56:00,718][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:56:01,727][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:56:02,382][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:56:03,039][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:56:04,655][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:56:05,573][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:56:06,671][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:56:07,636][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:56:08,293][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:56:08,951][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:56:09,608][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:56:10,265][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:56:10,923][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:56:11,582][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:56:12,240][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:56:12,897][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:56:13,555][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:56:14,212][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:56:14,999][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:48 [2026-03-25 16:56:16,399][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:56:16,402][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:56:16,403][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:56:18,319][__main__][INFO] - Iteration 182 took 1m 0s (12.82% Gen, 84.01% Train). Generation: 7s, Training: 50s. Estimated remaining time: 14h 0m 50s. Estimated total time: 16h 51m 3s. Time estimates for 10 more iterations: 10m 6s, 100 more iterations: 1h 41m 6s, 500 more iterations: 8h 25m 31s. [2026-03-25 16:56:18,322][__main__][INFO] - Starting iteration 182. [2026-03-25 16:56:18,326][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 16:56:18,326][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:56:25,683][__main__][INFO] - Number of regex retries in iteration 182: 0 [2026-03-25 16:56:25,684][__main__][INFO] - agents played in iteration 182 are Bob, Alice [2026-03-25 16:56:26,566][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:56:26,627][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:56:26,628][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:56:26,628][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:56:27,424][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:56:28,032][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:56:28,691][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:56:29,347][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:56:30,004][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:56:30,665][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:56:31,322][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:56:31,983][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:56:32,638][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:56:33,295][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:56:33,954][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:56:34,611][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:56:35,268][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:56:35,925][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:56:36,584][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:56:37,239][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:56:37,898][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:56:38,557][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:56:39,212][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:56:39,872][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:56:40,530][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:56:41,186][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:56:41,844][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:56:42,500][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:56:43,158][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:56:43,815][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:56:44,473][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:56:45,130][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:56:45,786][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:56:46,445][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:56:47,102][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:56:47,760][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:56:48,417][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:56:49,075][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:56:49,732][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:56:50,392][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:56:51,046][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:56:51,703][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:56:52,360][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:56:53,018][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:56:53,676][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:56:54,333][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:56:54,990][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:56:55,647][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:56:56,304][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:56:56,961][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:56:57,618][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:56:58,276][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:56:59,260][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:56:59,918][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:57:00,575][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:57:01,232][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:57:01,889][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:57:02,546][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:57:03,204][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:57:03,861][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:57:04,519][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:57:05,176][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:57:05,832][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:57:06,492][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:57:07,149][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:57:07,807][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:57:08,465][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:57:09,123][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:57:09,781][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:57:10,615][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 16:57:11,998][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:57:12,001][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:57:12,002][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:57:13,595][__main__][INFO] - Iteration 183 took 55s (13.31% Gen, 83.80% Train). Generation: 7s, Training: 46s. Estimated remaining time: 12h 30m 3s. Estimated total time: 15h 21m 11s. Time estimates for 10 more iterations: 9m 12s, 100 more iterations: 1h 32m 7s, 500 more iterations: 7h 40m 35s. [2026-03-25 16:57:13,598][__main__][INFO] - Starting iteration 183. [2026-03-25 16:57:13,603][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 16:57:13,603][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:57:24,508][__main__][INFO] - Number of regex retries in iteration 183: 0 [2026-03-25 16:57:24,509][__main__][INFO] - agents played in iteration 183 are Bob, Alice [2026-03-25 16:57:25,569][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:57:25,631][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:57:25,632][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:57:25,632][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:57:26,304][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:57:26,926][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:57:27,584][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:57:28,241][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:57:28,899][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:57:29,557][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:57:30,214][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:57:30,871][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:57:31,530][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:57:32,187][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:57:32,844][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:57:33,501][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:57:34,158][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:57:34,815][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:57:35,472][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:57:36,129][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:57:36,786][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:57:37,445][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:57:38,103][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:57:38,759][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:57:39,417][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:57:40,074][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:57:40,732][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:57:41,389][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:57:42,047][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:57:42,704][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:57:43,363][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:57:44,020][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:57:44,678][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:57:45,335][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:57:45,992][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:57:46,649][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:57:47,306][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:57:47,964][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:57:48,622][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:57:49,280][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:57:49,937][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:57:50,594][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:57:51,252][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:57:51,909][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:57:52,566][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:57:53,224][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:57:53,880][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:57:54,538][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:57:55,196][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:57:55,853][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:57:56,510][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:57:57,168][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:57:58,148][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:57:58,806][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:57:59,464][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:58:00,121][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:58:00,778][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:58:01,437][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:58:02,093][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:58:02,752][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:58:03,409][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:58:04,067][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:58:04,724][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:58:05,381][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:58:06,039][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:58:06,696][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:58:07,353][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:58:08,010][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:58:08,667][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:58:09,552][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 16:58:11,358][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:58:11,362][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:58:11,363][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:58:12,826][__main__][INFO] - Iteration 184 took 59s (18.41% Gen, 79.11% Train). Generation: 10s, Training: 46s. Estimated remaining time: 13h 34m 57s. Estimated total time: 16h 27m 5s. Time estimates for 10 more iterations: 9m 52s, 100 more iterations: 1h 38m 42s, 500 more iterations: 8h 13m 32s. [2026-03-25 16:58:12,829][__main__][INFO] - Starting iteration 184. [2026-03-25 16:58:12,833][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 16:58:12,833][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:58:21,878][__main__][INFO] - Number of regex retries in iteration 184: 0 [2026-03-25 16:58:21,880][__main__][INFO] - agents played in iteration 184 are Bob, Alice [2026-03-25 16:58:22,365][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:58:22,426][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:58:22,427][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:58:22,428][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:58:23,120][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:58:23,734][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:58:28,121][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:58:28,778][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:58:29,434][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:58:30,089][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:58:30,921][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:58:31,578][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:58:32,236][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:58:32,892][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:58:33,549][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:58:34,206][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:58:34,863][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:58:35,520][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:58:39,817][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:58:40,471][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:58:41,127][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:58:41,783][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:58:42,440][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:58:43,098][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:58:43,756][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:58:44,414][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:58:45,071][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:58:45,729][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:58:46,386][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:58:47,044][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:58:47,702][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:58:48,362][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:58:49,018][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:58:49,677][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:58:50,334][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:58:50,992][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:58:51,649][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:58:52,305][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:58:52,963][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:58:53,623][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:58:54,280][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:58:54,938][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:58:55,595][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:58:56,250][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:58:56,908][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:58:57,566][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:58:58,223][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:58:58,881][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:58:59,539][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:59:00,195][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:59:00,854][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:59:01,513][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:59:02,516][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:59:03,174][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:59:03,832][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:59:04,490][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:59:05,148][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:59:05,807][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:59:06,465][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:59:07,125][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:59:07,783][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:59:08,442][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:59:09,100][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:59:09,759][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:59:10,422][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:59:11,081][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:59:11,742][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:59:12,400][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:59:13,059][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:59:13,836][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:50 [2026-03-25 16:59:15,322][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:59:15,325][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:59:15,326][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:59:16,860][__main__][INFO] - Iteration 185 took 1m 4s (14.13% Gen, 83.47% Train). Generation: 9s, Training: 53s. Estimated remaining time: 14h 53m 57s. Estimated total time: 17h 47m 8s. Time estimates for 10 more iterations: 10m 40s, 100 more iterations: 1h 46m 42s, 500 more iterations: 8h 53m 34s. [2026-03-25 16:59:16,864][__main__][INFO] - Starting iteration 185. [2026-03-25 16:59:16,871][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 16:59:16,872][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:59:23,972][__main__][INFO] - Number of regex retries in iteration 185: 0 [2026-03-25 16:59:23,973][__main__][INFO] - agents played in iteration 185 are Bob, Alice [2026-03-25 16:59:24,544][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:59:24,605][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 16:59:24,606][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:59:24,606][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:59:25,354][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:59:25,961][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:59:26,619][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:59:27,276][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:59:27,933][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:59:28,590][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:59:29,248][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:59:29,905][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:59:30,563][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:59:31,220][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:59:31,879][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:59:32,536][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:59:33,193][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:59:33,852][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:59:34,509][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:59:35,166][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:59:35,824][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:59:36,482][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:59:37,139][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:59:37,798][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:59:38,456][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:59:39,113][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:59:39,770][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:59:40,428][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:59:41,085][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:59:41,742][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:59:42,399][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:59:43,056][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:59:43,713][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:59:44,370][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:59:45,027][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:59:45,685][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:59:46,342][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:59:47,000][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:59:47,660][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:59:48,317][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:59:48,971][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:59:49,630][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:59:50,288][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:59:50,945][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:59:51,603][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:59:52,261][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:59:52,918][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:59:53,576][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:59:54,232][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:59:54,889][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:59:55,549][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:59:56,207][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:59:57,197][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:59:57,859][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:59:58,517][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:59:59,174][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:59:59,832][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:00:00,490][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:00:01,147][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:00:01,805][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:00:02,462][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:00:03,119][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:00:03,778][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:00:04,436][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:00:05,093][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:00:05,751][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:00:06,408][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:00:07,065][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:00:07,723][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:00:08,590][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:00:09,919][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:00:09,922][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:00:09,923][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:00:11,501][__main__][INFO] - Iteration 186 took 54s (13.00% Gen, 84.11% Train). Generation: 7s, Training: 45s. Estimated remaining time: 12h 16m 25s. Estimated total time: 15h 10m 31s. Time estimates for 10 more iterations: 9m 6s, 100 more iterations: 1h 31m 3s, 500 more iterations: 7h 35m 15s. [2026-03-25 17:00:11,503][__main__][INFO] - Starting iteration 186. [2026-03-25 17:00:11,508][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 17:00:11,509][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:00:17,860][__main__][INFO] - Number of regex retries in iteration 186: 0 [2026-03-25 17:00:17,862][__main__][INFO] - agents played in iteration 186 are Bob, Alice [2026-03-25 17:00:18,839][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:00:18,900][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:00:18,900][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:00:18,901][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:00:19,731][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:00:20,339][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:00:20,997][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:00:21,656][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:00:22,311][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:00:22,970][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:00:23,629][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:00:24,286][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:00:24,945][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:00:25,602][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:00:26,260][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:00:26,919][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:00:27,576][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:00:28,234][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:00:28,893][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:00:29,551][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:00:30,207][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:00:30,865][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:00:31,523][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:00:32,180][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:00:32,837][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:00:33,494][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:00:34,152][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:00:34,809][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:00:35,466][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:00:36,123][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:00:36,781][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:00:37,439][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:00:38,097][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:00:38,755][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:00:39,413][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:00:40,072][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:00:40,729][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:00:41,386][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:00:42,043][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:00:42,700][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:00:43,359][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:00:44,017][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:00:44,675][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:00:45,333][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:00:45,990][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:00:46,647][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:00:47,305][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:00:47,962][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:00:48,619][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:00:49,278][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:00:49,935][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:00:50,599][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:00:51,585][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:00:52,245][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:00:52,904][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:00:53,564][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:00:54,222][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:00:54,879][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:00:55,537][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:00:56,194][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:00:56,852][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:00:57,509][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:00:58,167][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:00:58,825][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:00:59,482][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:01:00,140][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:01:00,796][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:01:01,454][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:01:02,111][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:01:02,915][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:01:04,944][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:01:04,947][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:01:04,949][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:01:06,393][__main__][INFO] - Iteration 187 took 54s (11.57% Gen, 85.79% Train). Generation: 6s, Training: 47s. Estimated remaining time: 12h 19m 46s. Estimated total time: 15h 14m 47s. Time estimates for 10 more iterations: 9m 8s, 100 more iterations: 1h 31m 28s, 500 more iterations: 7h 37m 23s. [2026-03-25 17:01:06,396][__main__][INFO] - Starting iteration 187. [2026-03-25 17:01:06,402][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 17:01:06,402][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:01:12,281][__main__][INFO] - Number of regex retries in iteration 187: 0 [2026-03-25 17:01:12,282][__main__][INFO] - agents played in iteration 187 are Bob, Alice [2026-03-25 17:01:13,228][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:01:13,289][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:01:13,290][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:01:13,291][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:01:13,954][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:01:14,573][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:01:15,232][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:01:15,888][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:01:16,544][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:01:17,199][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:01:17,856][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:01:18,515][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:01:19,171][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:01:19,828][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:01:20,486][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:01:21,143][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:01:21,801][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:01:22,458][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:01:23,115][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:01:23,773][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:01:24,431][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:01:25,088][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:01:25,747][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:01:26,404][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:01:27,061][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:01:27,717][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:01:28,376][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:01:29,033][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:01:29,690][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:01:30,347][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:01:31,005][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:01:31,662][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:01:32,319][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:01:32,977][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:01:33,634][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:01:34,291][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:01:34,948][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:01:35,606][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:01:36,265][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:01:36,919][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:01:37,578][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:01:38,237][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:01:38,892][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:01:39,550][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:01:40,209][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:01:40,864][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:01:41,523][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:01:42,180][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:01:42,837][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:01:43,495][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:01:44,153][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:01:44,810][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:01:45,800][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:01:46,456][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:01:47,114][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:01:47,771][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:01:48,429][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:01:49,088][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:01:49,746][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:01:50,403][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:01:51,061][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:01:51,719][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:01:52,377][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:01:53,034][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:01:53,693][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:01:54,350][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:01:55,008][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:01:55,665][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:01:56,322][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:01:57,096][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:01:58,477][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:01:58,480][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:01:58,482][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:01:59,953][__main__][INFO] - Iteration 188 took 53s (10.98% Gen, 86.27% Train). Generation: 5s, Training: 46s. Estimated remaining time: 11h 56m 38s. Estimated total time: 14h 52m 33s. Time estimates for 10 more iterations: 8m 55s, 100 more iterations: 1h 29m 15s, 500 more iterations: 7h 26m 16s. [2026-03-25 17:01:59,955][__main__][INFO] - Starting iteration 188. [2026-03-25 17:01:59,960][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 17:01:59,960][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:02:06,267][__main__][INFO] - Number of regex retries in iteration 188: 0 [2026-03-25 17:02:06,269][__main__][INFO] - agents played in iteration 188 are Bob, Alice [2026-03-25 17:02:06,845][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:02:06,907][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:02:06,907][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:02:06,908][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:02:07,605][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:02:08,226][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:02:08,882][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:02:09,540][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:02:10,197][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:02:10,855][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:02:11,513][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:02:12,172][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:02:12,830][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:02:13,488][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:02:14,146][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:02:14,803][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:02:15,461][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:02:16,119][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:02:16,777][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:02:17,435][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:02:18,092][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:02:18,749][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:02:19,407][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:02:20,064][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:02:20,722][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:02:21,381][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:02:22,038][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:02:22,699][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:02:23,355][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:02:24,012][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:02:24,670][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:02:25,328][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:02:25,987][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:02:26,644][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:02:27,301][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:02:27,959][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:02:28,617][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:02:29,275][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:02:29,932][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:02:30,590][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:02:31,247][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:02:31,905][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:02:32,562][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:02:33,220][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:02:33,878][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:02:34,536][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:02:35,193][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:02:35,850][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:02:36,507][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:02:37,164][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:02:37,821][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:02:38,479][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:02:39,465][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:02:40,124][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:02:40,780][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:02:41,439][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:02:42,098][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:02:42,753][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:02:43,412][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:02:44,069][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:02:44,727][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:02:45,385][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:02:46,043][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:02:46,700][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:02:47,357][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:02:48,016][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:02:48,671][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:02:49,329][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:02:49,987][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:02:50,767][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:02:52,149][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:02:52,152][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:02:52,153][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:02:54,067][__main__][INFO] - Iteration 189 took 54s (11.66% Gen, 84.80% Train). Generation: 6s, Training: 45s. Estimated remaining time: 12h 5m 0s. Estimated total time: 15h 1m 49s. Time estimates for 10 more iterations: 9m 1s, 100 more iterations: 1h 30m 10s, 500 more iterations: 7h 30m 54s. [2026-03-25 17:02:54,070][__main__][INFO] - Starting iteration 189. [2026-03-25 17:02:54,074][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 17:02:54,074][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:03:00,471][__main__][INFO] - Number of regex retries in iteration 189: 0 [2026-03-25 17:03:00,473][__main__][INFO] - agents played in iteration 189 are Bob, Alice [2026-03-25 17:03:00,933][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:03:00,994][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:03:00,995][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:03:00,995][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:03:01,693][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:03:02,298][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:03:02,957][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:03:03,613][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:03:04,272][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:03:04,929][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:03:05,586][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:03:06,247][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:03:06,906][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:03:07,563][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:03:08,220][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:03:08,877][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:03:09,534][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:03:10,192][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:03:10,851][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:03:11,507][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:03:12,164][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:03:12,823][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:03:13,481][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:03:14,139][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:03:14,796][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:03:15,453][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:03:16,111][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:03:16,768][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:03:17,427][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:03:18,085][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:03:18,742][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:03:19,399][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:03:20,056][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:03:20,713][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:03:21,370][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:03:22,028][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:03:22,685][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:03:23,343][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:03:24,000][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:03:24,658][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:03:25,315][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:03:25,974][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:03:26,633][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:03:27,291][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:03:27,948][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:03:28,607][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:03:29,265][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:03:29,922][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:03:30,580][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:03:31,237][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:03:31,894][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:03:32,552][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:03:33,534][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:03:34,192][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:03:34,849][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:03:35,507][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:03:36,163][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:03:36,821][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:03:37,479][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:03:38,138][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:03:38,795][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:03:39,455][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:03:40,112][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:03:40,769][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:03:41,427][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:03:42,084][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:03:42,741][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:03:43,398][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:03:44,055][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:03:44,833][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:03:46,159][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:03:46,162][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:03:46,163][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:03:47,759][__main__][INFO] - Iteration 190 took 53s (11.92% Gen, 85.10% Train). Generation: 6s, Training: 45s. Estimated remaining time: 11h 57m 5s. Estimated total time: 14h 54m 47s. Time estimates for 10 more iterations: 8m 56s, 100 more iterations: 1h 29m 28s, 500 more iterations: 7h 27m 23s. [2026-03-25 17:03:47,761][__main__][INFO] - Starting iteration 190. [2026-03-25 17:03:47,766][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 17:03:47,767][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:03:50,794][mllm.models.large_language_model_local][WARNING] - Response =A= did not match regex: (|), retry 1/1 [2026-03-25 17:03:54,035][__main__][INFO] - Number of regex retries in iteration 190: 1 [2026-03-25 17:03:54,036][__main__][INFO] - agents played in iteration 190 are Bob, Alice [2026-03-25 17:03:54,617][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:03:54,679][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:03:54,679][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:03:54,680][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:03:55,374][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:03:56,012][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:03:56,662][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:03:57,319][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:03:57,977][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:03:58,636][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:03:59,293][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:03:59,950][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:04:00,608][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:04:01,265][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:04:01,922][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:04:02,579][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:04:03,236][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:04:03,893][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:04:04,552][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:04:05,209][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:04:05,867][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:04:06,524][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:04:07,182][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:04:07,838][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:04:08,496][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:04:09,154][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:04:09,811][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:04:10,468][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:04:11,126][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:04:11,783][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:04:12,441][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:04:13,099][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:04:13,757][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:04:14,413][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:04:15,070][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:04:15,728][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:04:16,385][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:04:17,043][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:04:17,701][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:04:18,359][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:04:19,016][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:04:19,674][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:04:20,333][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:04:20,990][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:04:21,648][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:04:22,305][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:04:22,962][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:04:23,619][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:04:24,276][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:04:24,932][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:04:25,590][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:04:26,247][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:04:27,231][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:04:27,888][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:04:28,548][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:04:29,206][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:04:29,863][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:04:30,521][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:04:31,178][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:04:31,835][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:04:32,492][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:04:33,151][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:04:33,808][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:04:34,465][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:04:35,122][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:04:35,779][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:04:36,442][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:04:37,101][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:04:37,760][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:04:38,726][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:04:40,058][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:04:40,060][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:04:40,061][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:04:41,994][__main__][INFO] - Iteration 191 took 54s (11.56% Gen, 84.87% Train). Generation: 6s, Training: 46s. Estimated remaining time: 12h 5m 13s. Estimated total time: 15h 3m 50s. Time estimates for 10 more iterations: 9m 2s, 100 more iterations: 1h 30m 23s, 500 more iterations: 7h 31m 55s. [2026-03-25 17:04:41,996][__main__][INFO] - Starting iteration 191. [2026-03-25 17:04:42,000][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 17:04:42,001][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:04:46,870][__main__][INFO] - Number of regex retries in iteration 191: 0 [2026-03-25 17:04:46,871][__main__][INFO] - agents played in iteration 191 are Bob, Alice [2026-03-25 17:04:47,453][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:04:47,514][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:04:47,514][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:04:47,515][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:04:48,217][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:04:48,841][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:04:49,500][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:04:50,156][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:04:50,814][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:04:51,473][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:04:52,129][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:04:52,787][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:04:53,447][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:04:54,103][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:04:54,763][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:04:55,421][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:04:56,080][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:04:56,738][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:04:57,396][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:04:58,056][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:04:58,714][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:04:59,371][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:05:00,029][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:05:00,687][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:05:01,345][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:05:02,003][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:05:02,661][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:05:03,319][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:05:03,979][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:05:04,637][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:05:05,294][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:05:05,952][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:05:06,609][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:05:07,266][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:05:07,923][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:05:08,579][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:05:09,237][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:05:09,895][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:05:10,553][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:05:11,211][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:05:11,869][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:05:12,526][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:05:13,184][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:05:13,841][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:05:14,498][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:05:15,155][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:05:15,813][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:05:16,470][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:05:17,129][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:05:17,786][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:05:18,444][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:05:19,101][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:05:20,114][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:05:20,773][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:05:21,431][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:05:22,089][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:05:22,746][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:05:23,403][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:05:24,061][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:05:24,718][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:05:25,378][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:05:26,035][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:05:26,692][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:05:27,350][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:05:28,007][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:05:28,665][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:05:29,322][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:05:29,980][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:05:30,637][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:05:31,535][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:05:32,855][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:05:32,858][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:05:32,859][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:05:34,517][__main__][INFO] - Iteration 192 took 52s (9.27% Gen, 87.57% Train). Generation: 4s, Training: 45s. Estimated remaining time: 11h 35m 49s. Estimated total time: 14h 35m 18s. Time estimates for 10 more iterations: 8m 45s, 100 more iterations: 1h 27m 31s, 500 more iterations: 7h 17m 39s. [2026-03-25 17:05:34,519][__main__][INFO] - Starting iteration 192. [2026-03-25 17:05:34,523][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 17:05:34,523][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:05:40,171][__main__][INFO] - Number of regex retries in iteration 192: 0 [2026-03-25 17:05:40,172][__main__][INFO] - agents played in iteration 192 are Bob, Alice [2026-03-25 17:05:41,238][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:05:41,299][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:05:41,300][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:05:41,300][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:05:42,121][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:05:42,750][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:05:43,414][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:05:44,072][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:05:44,729][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:05:45,387][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:05:46,045][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:05:46,704][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:05:47,362][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:05:48,021][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:05:48,680][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:05:49,337][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:05:49,997][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:05:50,654][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:05:51,311][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:05:51,971][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:05:52,625][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:05:53,282][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:05:53,941][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:05:54,598][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:05:55,255][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:05:55,912][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:05:56,571][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:05:57,228][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:05:57,886][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:05:58,543][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:05:59,202][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:05:59,860][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:06:00,518][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:06:01,175][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:06:01,833][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:06:02,491][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:06:03,148][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:06:03,806][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:06:04,464][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:06:05,122][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:06:05,779][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:06:06,437][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:06:07,095][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:06:07,752][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:06:08,409][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:06:09,067][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:06:09,725][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:06:10,383][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:06:11,041][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:06:11,698][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:06:12,356][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:06:13,014][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:06:14,008][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:06:14,666][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:06:15,325][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:06:15,982][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:06:16,640][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:06:17,298][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:06:17,955][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:06:18,613][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:06:19,271][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:06:19,929][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:06:20,588][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:06:21,247][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:06:21,905][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:06:22,563][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:06:23,220][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:06:23,877][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:06:24,534][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:06:25,336][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:06:26,708][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:06:26,710][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:06:26,711][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:06:28,286][__main__][INFO] - Iteration 193 took 53s (10.51% Gen, 86.56% Train). Generation: 5s, Training: 46s. Estimated remaining time: 11h 55m 42s. Estimated total time: 14h 56m 4s. Time estimates for 10 more iterations: 8m 57s, 100 more iterations: 1h 29m 36s, 500 more iterations: 7h 28m 2s. [2026-03-25 17:06:28,288][__main__][INFO] - Starting iteration 193. [2026-03-25 17:06:28,292][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 17:06:28,293][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:06:34,113][__main__][INFO] - Number of regex retries in iteration 193: 0 [2026-03-25 17:06:34,115][__main__][INFO] - agents played in iteration 193 are Bob, Alice [2026-03-25 17:06:34,991][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:06:35,053][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:06:35,054][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:06:35,054][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:06:35,703][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:06:36,308][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:06:36,968][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:06:37,626][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:06:38,284][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:06:38,943][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:06:39,601][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:06:40,259][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:06:40,917][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:06:41,575][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:06:42,232][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:06:42,889][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:06:43,547][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:06:44,203][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:06:44,861][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:06:45,518][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:06:46,180][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:06:46,839][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:06:47,497][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:06:48,154][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:06:48,812][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:06:49,469][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:06:50,127][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:06:50,784][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:06:51,443][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:06:52,101][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:06:52,759][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:06:53,418][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:06:54,075][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:06:54,733][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:06:55,391][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:06:56,048][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:06:56,706][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:06:57,365][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:06:58,024][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:06:58,684][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:06:59,342][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:07:00,000][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:07:00,658][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:07:01,315][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:07:01,972][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:07:02,630][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:07:03,288][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:07:03,944][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:07:04,602][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:07:05,259][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:07:05,917][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:07:06,575][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:07:07,554][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:07:08,212][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:07:08,870][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:07:09,529][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:07:10,187][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:07:10,844][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:07:11,503][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:07:12,159][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:07:12,817][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:07:13,474][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:07:14,132][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:07:14,790][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:07:15,447][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:07:16,105][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:07:16,763][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:07:17,421][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:07:18,083][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:07:18,893][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:07:20,590][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:07:20,593][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:07:20,594][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:07:22,063][__main__][INFO] - Iteration 194 took 53s (10.83% Gen, 86.44% Train). Generation: 5s, Training: 46s. Estimated remaining time: 11h 54m 55s. Estimated total time: 14h 56m 12s. Time estimates for 10 more iterations: 8m 57s, 100 more iterations: 1h 29m 37s, 500 more iterations: 7h 28m 6s. [2026-03-25 17:07:22,066][__main__][INFO] - Starting iteration 194. [2026-03-25 17:07:22,071][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 17:07:22,071][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:07:28,408][__main__][INFO] - Number of regex retries in iteration 194: 0 [2026-03-25 17:07:28,410][__main__][INFO] - agents played in iteration 194 are Bob, Alice [2026-03-25 17:07:29,470][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:07:29,531][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:07:29,532][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:07:29,532][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:07:30,399][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:07:31,032][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:07:31,668][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:07:32,325][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:07:32,983][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:07:33,640][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:07:34,298][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:07:34,958][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:07:35,614][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:07:36,273][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:07:36,931][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:07:37,589][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:07:38,246][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:07:38,905][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:07:39,563][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:07:40,220][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:07:40,878][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:07:41,536][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:07:42,192][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:07:42,850][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:07:43,508][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:07:44,165][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:07:44,822][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:07:45,480][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:07:46,137][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:07:46,794][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:07:47,453][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:07:48,110][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:07:48,768][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:07:49,425][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:07:50,082][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:07:50,741][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:07:51,396][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:07:52,052][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:07:52,709][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:07:53,368][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:07:54,025][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:07:54,682][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:07:55,339][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:07:55,999][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:07:56,655][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:07:57,311][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:07:57,969][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:07:58,628][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:07:59,285][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:07:59,942][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:08:00,599][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:08:01,256][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:08:02,268][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:08:02,925][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:08:03,583][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:08:04,241][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:08:04,898][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:08:05,555][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:08:06,213][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:08:06,870][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:08:07,528][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:08:08,185][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:08:08,842][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:08:09,500][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:08:10,160][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:08:10,815][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:08:11,473][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:08:12,133][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:08:12,790][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:08:13,573][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:08:15,483][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:08:15,486][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:08:15,487][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:08:16,992][__main__][INFO] - Iteration 195 took 54s (11.54% Gen, 85.71% Train). Generation: 6s, Training: 47s. Estimated remaining time: 12h 13m 11s. Estimated total time: 15h 15m 23s. Time estimates for 10 more iterations: 9m 9s, 100 more iterations: 1h 31m 32s, 500 more iterations: 7h 37m 41s. [2026-03-25 17:08:16,994][__main__][INFO] - Starting iteration 195. [2026-03-25 17:08:16,999][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 17:08:16,999][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:08:25,346][__main__][INFO] - Number of regex retries in iteration 195: 0 [2026-03-25 17:08:25,347][__main__][INFO] - agents played in iteration 195 are Bob, Alice [2026-03-25 17:08:25,926][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:08:25,988][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:08:25,989][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:08:25,989][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:08:26,828][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:08:27,452][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:08:28,114][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:08:28,772][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:08:29,429][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:08:30,086][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:08:30,743][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:08:31,401][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:08:32,058][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:08:32,716][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:08:33,373][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:08:34,030][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:08:34,687][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:08:35,345][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:08:36,002][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:08:36,661][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:08:37,319][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:08:37,975][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:08:38,631][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:08:39,288][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:08:39,946][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:08:40,602][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:08:41,259][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:08:41,916][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:08:42,573][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:08:43,231][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:08:43,889][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:08:44,546][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:08:45,203][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:08:45,861][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:08:46,519][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:08:47,175][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:08:47,832][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:08:48,490][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:08:49,148][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:08:49,807][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:08:50,464][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:08:51,121][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:08:51,778][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:08:52,435][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:08:53,092][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:08:53,749][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:08:54,406][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:08:55,063][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:08:55,720][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:08:56,377][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:08:57,033][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:08:57,691][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:08:58,670][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:08:59,327][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:08:59,990][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:09:00,645][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:09:01,303][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:09:01,959][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:09:02,619][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:09:03,274][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:09:03,935][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:09:04,592][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:09:05,250][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:09:05,907][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:09:06,564][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:09:07,221][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:09:07,879][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:09:08,536][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:09:09,194][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:09:10,112][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:09:11,663][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:09:11,666][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:09:11,667][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:09:13,155][__main__][INFO] - Iteration 196 took 56s (14.86% Gen, 82.48% Train). Generation: 8s, Training: 46s. Estimated remaining time: 12h 32m 50s. Estimated total time: 15h 35m 58s. Time estimates for 10 more iterations: 9m 21s, 100 more iterations: 1h 33m 35s, 500 more iterations: 7h 47m 59s. [2026-03-25 17:09:13,157][__main__][INFO] - Starting iteration 196. [2026-03-25 17:09:13,162][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 17:09:13,162][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:09:18,154][__main__][INFO] - Number of regex retries in iteration 196: 0 [2026-03-25 17:09:18,156][__main__][INFO] - agents played in iteration 196 are Bob, Alice [2026-03-25 17:09:18,626][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:09:18,686][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:09:18,687][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:09:18,688][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:09:19,416][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:09:20,035][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:09:20,685][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:09:21,344][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:09:21,997][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:09:22,654][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:09:23,311][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:09:23,970][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:09:24,625][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:09:25,285][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:09:25,942][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:09:26,599][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:09:27,256][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:09:27,913][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:09:28,571][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:09:29,229][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:09:29,885][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:09:30,544][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:09:31,201][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:09:31,858][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:09:32,514][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:09:33,171][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:09:33,828][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:09:34,485][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:09:35,142][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:09:35,801][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:09:36,458][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:09:37,117][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:09:37,772][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:09:38,429][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:09:39,086][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:09:39,743][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:09:40,403][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:09:41,062][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:09:41,719][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:09:42,376][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:09:43,034][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:09:43,691][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:09:44,348][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:09:45,005][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:09:45,662][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:09:46,320][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:09:46,977][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:09:47,634][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:09:48,292][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:09:48,950][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:09:49,607][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:09:50,265][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:09:51,244][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:09:51,904][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:09:52,559][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:09:53,216][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:09:53,873][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:09:54,531][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:09:55,188][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:09:55,845][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:09:56,502][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:09:57,159][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:09:57,816][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:09:58,476][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:09:59,133][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:09:59,791][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:10:00,448][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:10:01,106][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:10:01,762][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:10:02,706][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:10:04,049][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:10:04,052][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:10:04,053][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:10:05,479][__main__][INFO] - Iteration 197 took 52s (9.54% Gen, 87.73% Train). Generation: 4s, Training: 45s. Estimated remaining time: 11h 27m 59s. Estimated total time: 14h 31m 59s. Time estimates for 10 more iterations: 8m 43s, 100 more iterations: 1h 27m 11s, 500 more iterations: 7h 15m 59s. [2026-03-25 17:10:05,481][__main__][INFO] - Starting iteration 197. [2026-03-25 17:10:05,485][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 17:10:05,486][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:10:11,525][__main__][INFO] - Number of regex retries in iteration 197: 0 [2026-03-25 17:10:11,526][__main__][INFO] - agents played in iteration 197 are Bob, Alice [2026-03-25 17:10:12,108][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:10:12,168][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:10:12,169][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:10:12,169][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:10:12,926][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:10:13,546][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:10:14,204][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:10:14,860][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:10:15,516][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:10:16,172][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:10:16,831][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:10:17,488][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:10:18,144][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:10:18,801][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:10:19,458][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:10:20,114][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:10:20,771][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:10:21,428][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:10:22,085][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:10:22,741][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:10:23,398][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:10:24,056][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:10:24,712][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:10:25,370][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:10:26,027][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:10:26,684][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:10:27,341][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:10:28,000][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:10:28,658][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:10:29,317][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:10:29,974][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:10:30,631][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:10:31,288][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:10:31,946][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:10:32,603][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:10:33,260][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:10:33,918][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:10:34,576][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:10:35,233][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:10:35,891][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:10:36,548][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:10:37,205][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:10:37,862][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:10:38,519][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:10:39,176][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:10:39,837][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:10:40,492][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:10:41,148][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:10:41,805][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:10:42,464][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:10:43,121][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:10:43,779][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:10:44,766][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:10:45,424][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:10:46,081][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:10:46,738][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:10:47,395][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:10:48,052][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:10:48,710][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:10:49,368][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:10:50,024][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:10:50,681][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:10:51,338][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:10:51,999][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:10:52,654][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:10:53,314][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:10:53,970][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:10:54,627][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:10:55,284][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:10:56,089][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:10:57,442][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:10:57,445][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:10:57,446][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:10:59,091][__main__][INFO] - Iteration 198 took 53s (11.27% Gen, 85.66% Train). Generation: 6s, Training: 45s. Estimated remaining time: 11h 48m 34s. Estimated total time: 14h 53m 28s. Time estimates for 10 more iterations: 8m 56s, 100 more iterations: 1h 29m 20s, 500 more iterations: 7h 26m 44s. [2026-03-25 17:10:59,094][__main__][INFO] - Starting iteration 198. [2026-03-25 17:10:59,098][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 17:10:59,099][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:11:06,697][__main__][INFO] - Number of regex retries in iteration 198: 0 [2026-03-25 17:11:06,698][__main__][INFO] - agents played in iteration 198 are Bob, Alice [2026-03-25 17:11:07,768][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:11:07,829][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:11:07,830][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:11:07,830][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:11:08,559][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:11:09,165][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:11:09,823][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:11:10,483][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:11:11,142][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:11:11,799][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:11:12,456][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:11:13,114][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:11:13,771][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:11:14,428][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:11:15,086][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:11:15,743][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:11:16,400][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:11:17,057][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:11:17,714][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:11:18,371][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:11:19,027][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:11:19,685][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:11:20,342][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:11:20,998][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:11:21,655][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:11:22,312][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:11:22,969][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:11:23,626][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:11:24,283][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:11:24,940][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:11:25,597][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:11:26,253][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:11:26,911][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:11:27,566][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:11:28,223][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:11:28,881][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:11:29,539][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:11:30,196][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:11:30,853][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:11:31,510][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:11:32,166][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:11:32,824][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:11:33,481][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:11:34,138][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:11:34,795][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:11:35,451][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:11:36,109][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:11:36,766][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:11:37,423][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:11:38,083][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:11:38,739][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:11:39,396][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:11:40,367][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:11:41,025][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:11:41,682][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:11:42,339][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:11:42,996][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:11:43,653][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:11:44,310][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:11:44,966][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:11:45,623][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:11:46,280][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:11:46,937][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:11:47,594][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:11:48,253][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:11:48,910][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:11:49,568][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:11:50,225][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:11:50,882][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:11:51,794][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:11:53,130][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:11:53,133][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:11:53,134][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:11:54,822][__main__][INFO] - Iteration 199 took 55s (13.64% Gen, 83.33% Train). Generation: 7s, Training: 46s. Estimated remaining time: 12h 22m 56s. Estimated total time: 15h 28m 45s. Time estimates for 10 more iterations: 9m 17s, 100 more iterations: 1h 32m 52s, 500 more iterations: 7h 44m 22s. [2026-03-25 17:11:54,824][__main__][INFO] - Starting iteration 199. [2026-03-25 17:11:54,829][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 17:11:54,829][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:12:00,649][__main__][INFO] - Number of regex retries in iteration 199: 0 [2026-03-25 17:12:00,651][__main__][INFO] - agents played in iteration 199 are Bob, Alice [2026-03-25 17:12:01,494][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:12:01,555][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:12:01,556][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:12:01,556][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:12:02,275][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:12:02,900][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:12:03,558][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:12:04,214][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:12:04,871][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:12:05,526][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:12:06,182][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:12:06,838][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:12:07,495][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:12:08,152][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:12:08,809][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:12:09,466][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:12:10,122][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:12:10,779][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:12:11,436][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:12:12,095][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:12:12,752][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:12:13,410][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:12:14,067][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:12:14,723][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:12:15,380][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:12:16,037][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:12:16,695][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:12:17,352][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:12:18,009][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:12:18,666][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:12:19,325][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:12:19,980][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:12:20,637][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:12:21,295][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:12:21,952][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:12:22,609][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:12:23,266][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:12:23,924][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:12:24,581][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:12:25,239][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:12:25,896][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:12:26,553][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:12:27,210][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:12:27,869][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:12:28,527][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:12:29,184][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:12:29,841][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:12:30,498][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:12:31,156][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:12:31,814][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:12:32,471][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:12:33,128][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:12:34,100][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:12:34,757][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:12:35,414][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:12:36,071][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:12:36,728][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:12:37,386][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:12:38,044][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:12:38,701][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:12:39,358][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:12:40,017][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:12:40,674][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:12:41,332][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:12:41,990][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:12:42,647][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:12:43,304][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:12:43,961][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:12:44,618][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:12:45,386][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:12:46,868][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:12:46,870][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:12:46,871][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:12:48,510][__main__][INFO] - Iteration 200 took 53s (10.84% Gen, 86.10% Train). Generation: 5s, Training: 46s. Estimated remaining time: 11h 47m 59s. Estimated total time: 14h 54m 42s. Time estimates for 10 more iterations: 8m 56s, 100 more iterations: 1h 29m 28s, 500 more iterations: 7h 27m 21s. [2026-03-25 17:12:48,512][__main__][INFO] - Starting iteration 200. [2026-03-25 17:12:48,517][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 17:12:48,518][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:12:53,816][__main__][INFO] - Number of regex retries in iteration 200: 0 [2026-03-25 17:12:53,817][__main__][INFO] - agents played in iteration 200 are Bob, Alice [2026-03-25 17:12:54,392][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:12:54,453][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:12:54,454][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:12:54,455][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:12:55,350][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:12:55,979][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:12:56,640][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:12:57,298][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:12:57,956][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:12:58,615][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:12:59,272][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:12:59,929][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:13:00,586][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:13:01,242][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:13:01,900][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:13:02,556][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:13:03,213][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:13:03,872][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:13:04,528][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:13:05,186][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:13:05,845][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:13:06,502][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:13:07,159][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:13:07,819][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:13:08,480][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:13:09,138][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:13:09,795][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:13:10,453][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:13:11,109][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:13:11,766][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:13:12,423][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:13:13,081][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:13:13,739][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:13:14,396][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:13:15,053][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:13:15,709][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:13:16,367][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:13:17,025][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:13:17,681][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:13:18,339][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:13:18,996][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:13:19,654][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:13:20,312][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:13:20,970][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:13:21,627][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:13:22,284][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:13:22,942][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:13:23,599][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:13:24,257][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:13:24,915][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:13:25,571][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:13:26,229][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:13:27,220][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:13:27,877][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:13:28,535][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:13:29,194][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:13:29,851][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:13:30,509][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:13:31,166][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:13:31,823][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:13:32,480][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:13:33,138][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:13:33,795][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:13:34,452][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:13:35,109][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:13:35,766][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:13:36,423][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:13:37,081][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:13:37,739][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:13:38,548][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:13:39,908][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:13:39,911][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:13:39,912][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:13:43,256][__main__][INFO] - Iteration 201 took 54s (9.68% Gen, 84.20% Train). Generation: 5s, Training: 46s. Estimated remaining time: 12h 4m 43s. Estimated total time: 15h 12m 21s. Time estimates for 10 more iterations: 9m 7s, 100 more iterations: 1h 31m 14s, 500 more iterations: 7h 36m 10s. [2026-03-25 17:13:43,259][__main__][INFO] - Starting iteration 201. [2026-03-25 17:13:43,263][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:13:43,263][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:13:50,608][__main__][INFO] - Number of regex retries in iteration 201: 0 [2026-03-25 17:13:50,610][__main__][INFO] - agents played in iteration 201 are Bob, Alice [2026-03-25 17:13:51,145][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:13:51,209][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:13:51,210][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:13:51,210][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:13:52,072][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:13:52,688][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:13:53,347][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:13:54,002][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:13:54,660][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:13:55,316][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:13:55,978][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:13:56,637][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:13:57,294][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:13:57,952][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:13:58,609][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:13:59,269][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:13:59,933][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:14:00,589][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:14:01,246][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:14:01,903][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:14:02,560][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:14:03,218][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:14:03,876][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:14:04,533][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:14:05,192][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:14:05,849][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:14:06,506][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:14:07,163][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:14:07,820][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:14:08,478][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:14:09,135][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:14:09,793][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:14:10,451][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:14:11,108][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:14:11,766][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:14:12,422][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:14:13,080][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:14:13,737][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:14:14,394][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:14:15,051][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:14:15,708][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:14:16,366][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:14:17,024][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:14:17,681][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:14:18,338][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:14:18,995][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:14:19,652][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:14:20,309][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:14:20,966][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:14:21,623][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:14:22,281][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:14:22,938][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:14:23,940][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:14:24,599][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:14:25,259][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:14:25,917][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:14:26,573][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:14:27,232][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:14:27,890][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:14:28,548][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:14:29,206][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:14:29,863][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:14:30,520][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:14:31,177][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:14:31,834][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:14:32,491][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:14:33,149][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:14:33,806][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:14:34,463][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:14:35,252][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:14:36,917][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:14:36,920][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:14:36,921][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:14:38,395][__main__][INFO] - Iteration 202 took 55s (13.32% Gen, 84.00% Train). Generation: 7s, Training: 46s. Estimated remaining time: 12h 10m 21s. Estimated total time: 15h 18m 54s. Time estimates for 10 more iterations: 9m 11s, 100 more iterations: 1h 31m 53s, 500 more iterations: 7h 39m 27s. [2026-03-25 17:14:38,397][__main__][INFO] - Starting iteration 202. [2026-03-25 17:14:38,403][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:14:38,404][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:14:45,002][__main__][INFO] - Number of regex retries in iteration 202: 0 [2026-03-25 17:14:45,003][__main__][INFO] - agents played in iteration 202 are Bob, Alice [2026-03-25 17:14:45,536][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:14:45,598][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:14:45,598][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:14:45,599][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:14:46,442][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:14:47,062][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:14:47,721][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:14:48,378][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:14:49,035][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:14:49,692][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:14:50,350][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:14:51,009][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:14:51,665][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:14:52,323][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:14:52,981][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:14:53,639][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:14:54,298][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:14:54,955][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:14:55,612][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:14:56,270][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:14:56,928][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:14:57,586][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:14:58,245][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:14:58,903][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:14:59,560][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:15:00,217][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:15:00,875][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:15:01,532][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:15:02,189][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:15:02,846][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:15:03,505][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:15:04,162][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:15:04,821][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:15:05,478][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:15:06,135][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:15:06,793][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:15:07,451][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:15:08,108][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:15:08,765][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:15:09,422][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:15:10,080][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:15:10,738][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:15:11,395][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:15:12,052][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:15:12,709][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:15:13,367][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:15:14,024][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:15:14,681][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:15:15,338][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:15:15,995][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:15:16,652][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:15:17,310][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:15:18,309][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:15:18,966][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:15:19,623][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:15:20,280][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:15:20,939][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:15:21,597][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:15:22,254][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:15:22,912][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:15:23,569][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:15:24,226][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:15:24,883][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:15:25,540][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:15:26,200][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:15:26,856][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:15:27,513][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:15:28,171][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:15:28,829][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:15:29,582][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:15:30,957][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:15:30,960][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:15:30,961][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:15:32,469][__main__][INFO] - Iteration 203 took 54s (12.20% Gen, 85.00% Train). Generation: 6s, Training: 45s. Estimated remaining time: 11h 51m 41s. Estimated total time: 15h 1m 8s. Time estimates for 10 more iterations: 9m 0s, 100 more iterations: 1h 30m 6s, 500 more iterations: 7h 30m 34s. [2026-03-25 17:15:32,472][__main__][INFO] - Starting iteration 203. [2026-03-25 17:15:32,476][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:15:32,476][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:15:37,642][__main__][INFO] - Number of regex retries in iteration 203: 0 [2026-03-25 17:15:37,643][__main__][INFO] - agents played in iteration 203 are Bob, Alice [2026-03-25 17:15:38,528][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:15:38,589][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:15:38,590][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:15:38,590][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:15:39,336][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:15:39,950][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:15:40,609][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:15:41,266][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:15:41,925][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:15:42,583][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:15:43,241][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:15:43,899][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:15:44,556][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:15:45,214][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:15:45,873][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:15:46,529][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:15:47,188][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:15:47,845][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:15:48,503][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:15:49,161][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:15:49,819][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:15:50,476][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:15:51,134][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:15:51,792][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:15:52,451][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:15:53,110][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:15:53,770][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:15:54,427][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:15:55,084][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:15:55,742][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:15:56,399][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:15:57,056][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:15:57,713][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:15:58,371][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:15:59,029][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:15:59,687][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:16:00,344][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:16:01,002][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:16:01,659][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:16:02,317][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:16:02,975][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:16:03,632][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:16:04,289][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:16:04,946][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:16:05,603][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:16:06,260][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:16:06,917][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:16:07,574][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:16:08,231][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:16:08,889][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:16:09,546][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:16:10,205][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:16:11,196][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:16:11,854][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:16:12,511][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:16:13,169][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:16:13,827][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:16:14,485][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:16:15,142][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:16:15,800][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:16:16,460][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:16:17,117][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:16:17,775][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:16:18,434][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:16:19,092][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:16:19,750][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:16:20,412][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:16:21,070][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:16:21,728][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:16:22,597][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:16:24,010][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:16:24,012][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:16:24,014][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:16:25,374][__main__][INFO] - Iteration 204 took 52s (9.77% Gen, 87.66% Train). Generation: 5s, Training: 46s. Estimated remaining time: 11h 31m 20s. Estimated total time: 14h 41m 40s. Time estimates for 10 more iterations: 8m 49s, 100 more iterations: 1h 28m 10s, 500 more iterations: 7h 20m 50s. [2026-03-25 17:16:25,376][__main__][INFO] - Starting iteration 204. [2026-03-25 17:16:25,382][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:16:25,382][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:16:37,216][__main__][INFO] - Number of regex retries in iteration 204: 0 [2026-03-25 17:16:37,217][__main__][INFO] - agents played in iteration 204 are Bob, Alice [2026-03-25 17:16:38,257][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:16:38,318][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:16:38,318][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:16:38,319][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:16:39,064][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:16:39,692][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:16:40,352][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:16:41,010][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:16:41,667][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:16:42,325][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:16:42,983][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:16:43,641][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:16:44,298][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:16:44,956][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:16:45,613][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:16:46,272][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:16:46,930][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:16:47,588][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:16:48,245][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:16:48,904][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:16:49,564][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:16:50,222][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:16:50,879][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:16:51,536][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:16:52,194][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:16:52,851][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:16:53,509][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:16:54,167][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:16:54,824][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:16:55,481][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:16:56,139][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:16:56,797][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:16:57,455][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:16:58,112][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:16:58,770][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:16:59,429][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:17:00,086][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:17:00,743][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:17:01,400][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:17:02,058][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:17:02,715][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:17:03,372][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:17:04,030][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:17:04,687][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:17:05,344][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:17:06,002][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:17:06,660][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:17:07,318][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:17:07,976][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:17:08,634][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:17:09,294][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:17:09,951][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:17:10,936][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:17:11,596][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:17:12,253][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:17:12,911][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:17:13,568][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:17:14,228][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:17:14,886][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:17:15,544][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:17:16,203][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:17:16,861][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:17:17,521][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:17:18,178][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:17:18,838][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:17:19,495][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:17:20,154][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:17:20,811][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:17:21,471][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:17:22,167][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:17:24,106][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:17:24,109][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:17:24,110][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:17:25,679][__main__][INFO] - Iteration 205 took 1m 0s (19.63% Gen, 77.77% Train). Generation: 11s, Training: 46s. Estimated remaining time: 13h 33m 39s. Estimated total time: 16h 44m 59s. Time estimates for 10 more iterations: 10m 2s, 100 more iterations: 1h 40m 29s, 500 more iterations: 8h 22m 29s. [2026-03-25 17:17:25,681][__main__][INFO] - Starting iteration 205. [2026-03-25 17:17:25,684][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:17:25,685][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:17:32,165][__main__][INFO] - Number of regex retries in iteration 205: 0 [2026-03-25 17:17:32,166][__main__][INFO] - agents played in iteration 205 are Bob, Alice [2026-03-25 17:17:32,631][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:17:32,692][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:17:32,692][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:17:32,693][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:17:33,444][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:17:34,056][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:17:34,710][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:17:35,367][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:17:36,024][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:17:36,681][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:17:37,341][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:17:37,999][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:17:38,658][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:17:39,317][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:17:39,975][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:17:40,633][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:17:41,292][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:17:41,949][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:17:42,606][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:17:43,264][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:17:43,922][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:17:44,580][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:17:45,238][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:17:45,897][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:17:46,555][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:17:47,212][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:17:47,870][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:17:48,527][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:17:49,186][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:17:49,843][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:17:50,501][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:17:51,158][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:17:51,818][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:17:52,476][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:17:53,133][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:17:53,791][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:17:54,448][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:17:55,106][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:17:55,763][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:17:56,421][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:17:57,078][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:17:57,736][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:17:58,395][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:17:59,053][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:17:59,712][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:18:00,370][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:18:01,028][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:18:01,685][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:18:02,343][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:18:03,001][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:18:03,658][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:18:04,316][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:18:05,309][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:18:05,968][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:18:06,630][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:18:07,287][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:18:07,945][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:18:08,603][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:18:09,262][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:18:09,920][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:18:10,579][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:18:11,238][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:18:11,896][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:18:12,555][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:18:13,213][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:18:13,870][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:18:14,528][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:18:15,186][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:18:15,843][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:18:16,749][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:18:18,732][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:18:18,735][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:18:18,736][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:18:20,128][__main__][INFO] - Iteration 206 took 54s (11.90% Gen, 85.54% Train). Generation: 6s, Training: 46s. Estimated remaining time: 11h 55m 10s. Estimated total time: 15h 7m 25s. Time estimates for 10 more iterations: 9m 4s, 100 more iterations: 1h 30m 44s, 500 more iterations: 7h 33m 42s. [2026-03-25 17:18:20,131][__main__][INFO] - Starting iteration 206. [2026-03-25 17:18:20,135][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:18:20,135][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:18:25,953][__main__][INFO] - Number of regex retries in iteration 206: 0 [2026-03-25 17:18:25,954][__main__][INFO] - agents played in iteration 206 are Bob, Alice [2026-03-25 17:18:26,497][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:18:26,558][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:18:26,559][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:18:26,559][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:18:27,431][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:18:28,048][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:18:28,710][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:18:29,368][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:18:30,025][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:18:30,686][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:18:31,345][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:18:32,004][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:18:32,661][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:18:33,319][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:18:33,977][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:18:34,634][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:18:35,291][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:18:35,948][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:18:36,605][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:18:37,263][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:18:37,921][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:18:38,578][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:18:39,237][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:18:39,894][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:18:40,551][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:18:41,208][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:18:41,865][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:18:42,521][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:18:43,179][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:18:43,837][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:18:44,494][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:18:45,151][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:18:45,809][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:18:46,467][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:18:47,124][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:18:47,782][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:18:48,439][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:18:49,096][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:18:49,752][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:18:50,410][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:18:51,068][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:18:51,725][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:18:52,382][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:18:53,039][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:18:53,697][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:18:54,355][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:18:55,013][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:18:55,671][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:18:56,329][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:18:56,986][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:18:57,644][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:18:58,303][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:18:59,297][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:18:59,955][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:19:00,615][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:19:01,274][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:19:01,931][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:19:02,589][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:19:03,248][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:19:03,906][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:19:04,565][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:19:05,222][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:19:05,881][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:19:06,538][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:19:07,197][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:19:07,856][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:19:08,514][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:19:09,172][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:19:09,830][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:19:10,523][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:19:11,874][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:19:11,877][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:19:11,878][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:19:13,432][__main__][INFO] - Iteration 207 took 53s (10.92% Gen, 86.16% Train). Generation: 5s, Training: 45s. Estimated remaining time: 11h 35m 11s. Estimated total time: 14h 48m 19s. Time estimates for 10 more iterations: 8m 52s, 100 more iterations: 1h 28m 49s, 500 more iterations: 7h 24m 9s. [2026-03-25 17:19:13,435][__main__][INFO] - Starting iteration 207. [2026-03-25 17:19:13,458][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:19:13,459][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:19:20,080][__main__][INFO] - Number of regex retries in iteration 207: 0 [2026-03-25 17:19:20,082][__main__][INFO] - agents played in iteration 207 are Bob, Alice [2026-03-25 17:19:20,574][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:19:20,635][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:19:20,635][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:19:20,636][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:19:21,481][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:19:22,096][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:19:22,752][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:19:23,409][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:19:24,066][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:19:24,722][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:19:25,380][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:19:26,038][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:19:26,694][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:19:27,351][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:19:28,009][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:19:28,668][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:19:29,328][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:19:29,985][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:19:30,642][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:19:31,299][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:19:31,957][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:19:32,617][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:19:33,275][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:19:33,934][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:19:34,591][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:19:35,248][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:19:35,907][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:19:36,564][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:19:37,222][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:19:37,880][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:19:38,537][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:19:39,195][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:19:39,854][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:19:40,511][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:19:41,168][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:19:41,826][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:19:42,483][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:19:43,140][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:19:43,798][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:19:44,455][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:19:45,112][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:19:45,770][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:19:46,427][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:19:47,084][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:19:47,741][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:19:48,399][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:19:49,057][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:19:49,715][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:19:50,374][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:19:51,031][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:19:51,688][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:19:52,345][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:19:53,343][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:19:54,002][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:19:54,662][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:19:55,318][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:19:55,978][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:19:56,636][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:19:57,295][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:19:57,952][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:19:58,612][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:19:59,270][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:19:59,926][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:20:00,586][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:20:01,243][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:20:01,900][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:20:02,557][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:20:03,216][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:20:03,875][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:20:04,609][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:20:06,566][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:20:06,569][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:20:06,570][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:20:08,034][__main__][INFO] - Iteration 208 took 54s (12.14% Gen, 85.18% Train). Generation: 6s, Training: 46s. Estimated remaining time: 11h 55m 34s. Estimated total time: 15h 9m 37s. Time estimates for 10 more iterations: 9m 5s, 100 more iterations: 1h 30m 57s, 500 more iterations: 7h 34m 48s. [2026-03-25 17:20:08,037][__main__][INFO] - Starting iteration 208. [2026-03-25 17:20:08,040][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:20:08,040][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:20:13,809][__main__][INFO] - Number of regex retries in iteration 208: 0 [2026-03-25 17:20:13,810][__main__][INFO] - agents played in iteration 208 are Bob, Alice [2026-03-25 17:20:14,379][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:20:14,440][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:20:14,440][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:20:14,441][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:20:15,260][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:20:15,863][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:20:16,521][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:20:17,178][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:20:17,835][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:20:18,492][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:20:19,149][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:20:19,806][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:20:20,463][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:20:21,121][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:20:21,780][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:20:22,437][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:20:23,094][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:20:23,752][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:20:24,410][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:20:25,068][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:20:25,725][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:20:26,383][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:20:27,040][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:20:27,698][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:20:28,356][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:20:29,013][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:20:29,672][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:20:30,329][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:20:30,986][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:20:31,643][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:20:32,300][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:20:32,958][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:20:33,616][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:20:34,274][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:20:34,931][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:20:35,589][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:20:36,247][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:20:36,905][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:20:37,562][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:20:38,219][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:20:38,876][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:20:39,533][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:20:40,190][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:20:40,848][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:20:41,506][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:20:42,163][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:20:42,820][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:20:43,478][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:20:44,135][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:20:44,792][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:20:45,450][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:20:46,108][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:20:47,102][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:20:47,760][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:20:48,418][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:20:49,077][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:20:49,735][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:20:50,392][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:20:51,049][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:20:51,708][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:20:52,366][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:20:53,024][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:20:53,682][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:20:54,341][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:20:54,998][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:20:55,656][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:20:56,313][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:20:56,970][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:20:57,627][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:20:58,456][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:20:59,922][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:20:59,924][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:20:59,926][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:21:01,480][__main__][INFO] - Iteration 209 took 53s (10.80% Gen, 86.34% Train). Generation: 5s, Training: 46s. Estimated remaining time: 11h 35m 45s. Estimated total time: 14h 50m 41s. Time estimates for 10 more iterations: 8m 54s, 100 more iterations: 1h 29m 4s, 500 more iterations: 7h 25m 20s. [2026-03-25 17:21:01,482][__main__][INFO] - Starting iteration 209. [2026-03-25 17:21:01,486][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:21:01,486][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:21:08,248][__main__][INFO] - Number of regex retries in iteration 209: 0 [2026-03-25 17:21:08,250][__main__][INFO] - agents played in iteration 209 are Bob, Alice [2026-03-25 17:21:09,062][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:21:09,124][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:21:09,124][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:21:09,125][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:21:09,879][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:21:10,497][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:21:11,157][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:21:11,813][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:21:12,471][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:21:13,128][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:21:13,787][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:21:14,444][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:21:15,101][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:21:15,759][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:21:16,417][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:21:17,074][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:21:17,732][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:21:18,389][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:21:19,047][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:21:19,705][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:21:20,364][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:21:21,022][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:21:21,679][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:21:22,336][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:21:22,993][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:21:23,650][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:21:24,310][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:21:24,969][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:21:25,628][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:21:26,286][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:21:26,945][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:21:27,603][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:21:28,261][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:21:28,920][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:21:29,577][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:21:30,234][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:21:30,891][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:21:31,549][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:21:32,206][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:21:32,863][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:21:33,521][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:21:34,178][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:21:34,836][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:21:35,495][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:21:36,152][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:21:36,809][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:21:37,466][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:21:38,123][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:21:38,780][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:21:39,437][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:21:40,094][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:21:40,752][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:21:41,743][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:21:42,402][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:21:43,061][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:21:43,718][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:21:44,378][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:21:45,034][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:21:45,694][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:21:46,352][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:21:47,010][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:21:47,668][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:21:48,326][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:21:48,983][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:21:49,642][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:21:50,300][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:21:50,957][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:21:51,614][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:21:52,271][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:21:53,053][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:21:54,515][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:21:54,518][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:21:54,519][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:21:55,869][__main__][INFO] - Iteration 210 took 54s (12.44% Gen, 85.08% Train). Generation: 6s, Training: 46s. Estimated remaining time: 11h 50m 35s. Estimated total time: 15h 6m 25s. Time estimates for 10 more iterations: 9m 3s, 100 more iterations: 1h 30m 38s, 500 more iterations: 7h 33m 12s. [2026-03-25 17:21:55,871][__main__][INFO] - Starting iteration 210. [2026-03-25 17:21:55,875][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:21:55,876][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:22:01,890][__main__][INFO] - Number of regex retries in iteration 210: 0 [2026-03-25 17:22:01,892][__main__][INFO] - agents played in iteration 210 are Bob, Alice [2026-03-25 17:22:02,966][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:22:03,028][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:22:03,028][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:22:03,029][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:22:03,732][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:22:04,360][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:22:05,015][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:22:05,672][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:22:06,331][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:22:06,988][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:22:07,650][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:22:08,308][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:22:08,967][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:22:09,662][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:22:10,319][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:22:10,977][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:22:11,636][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:22:12,292][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:22:12,951][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:22:13,609][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:22:14,268][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:22:14,925][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:22:15,583][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:22:16,241][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:22:16,900][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:22:17,557][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:22:18,214][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:22:18,871][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:22:19,528][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:22:20,185][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:22:20,844][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:22:21,502][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:22:22,160][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:22:22,817][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:22:23,474][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:22:24,132][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:22:24,789][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:22:25,447][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:22:26,104][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:22:26,761][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:22:27,418][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:22:28,082][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:22:28,739][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:22:29,396][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:22:30,054][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:22:30,712][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:22:31,370][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:22:32,027][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:22:32,684][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:22:33,343][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:22:34,000][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:22:34,657][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:22:35,656][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:22:36,314][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:22:36,972][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:22:37,632][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:22:38,290][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:22:38,949][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:22:39,608][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:22:40,267][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:22:40,924][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:22:41,583][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:22:42,242][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:22:42,898][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:22:43,558][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:22:44,215][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:22:44,874][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:22:45,531][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:22:46,189][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:22:46,933][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:22:48,447][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:22:48,450][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:22:48,451][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:22:49,876][__main__][INFO] - Iteration 211 took 54s (11.14% Gen, 86.22% Train). Generation: 6s, Training: 46s. Estimated remaining time: 11h 43m 17s. Estimated total time: 15h 0m 1s. Time estimates for 10 more iterations: 9m 0s, 100 more iterations: 1h 30m 0s, 500 more iterations: 7h 30m 0s. [2026-03-25 17:22:49,879][__main__][INFO] - Starting iteration 211. [2026-03-25 17:22:49,894][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:22:49,895][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:22:55,468][__main__][INFO] - Number of regex retries in iteration 211: 0 [2026-03-25 17:22:55,469][__main__][INFO] - agents played in iteration 211 are Bob, Alice [2026-03-25 17:22:55,935][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:22:55,995][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:22:55,996][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:22:55,996][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:22:56,772][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:22:57,400][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:22:58,058][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:22:58,718][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:22:59,375][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:23:00,032][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:23:00,694][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:23:01,353][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:23:02,010][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:23:02,667][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:23:03,326][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:23:03,984][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:23:04,641][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:23:05,299][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:23:05,956][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:23:06,613][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:23:07,271][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:23:07,929][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:23:08,586][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:23:09,245][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:23:09,902][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:23:10,560][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:23:11,218][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:23:11,876][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:23:12,533][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:23:13,190][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:23:13,849][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:23:14,506][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:23:15,163][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:23:15,820][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:23:16,477][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:23:17,135][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:23:17,797][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:23:18,454][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:23:19,111][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:23:19,768][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:23:20,425][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:23:21,082][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:23:21,741][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:23:22,398][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:23:23,056][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:23:23,712][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:23:24,370][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:23:25,027][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:23:25,684][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:23:26,341][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:23:27,003][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:23:27,659][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:23:28,667][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:23:29,327][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:23:29,984][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:23:30,641][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:23:31,300][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:23:31,957][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:23:32,614][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:23:33,273][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:23:33,931][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:23:34,590][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:23:35,248][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:23:35,907][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:23:36,564][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:23:37,222][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:23:37,881][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:23:38,538][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:23:39,195][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:23:39,968][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:23:41,424][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:23:41,427][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:23:41,428][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:23:42,833][__main__][INFO] - Iteration 212 took 52s (10.53% Gen, 86.81% Train). Generation: 5s, Training: 45s. Estimated remaining time: 11h 24m 43s. Estimated total time: 14h 42m 20s. Time estimates for 10 more iterations: 8m 49s, 100 more iterations: 1h 28m 14s, 500 more iterations: 7h 21m 10s. [2026-03-25 17:23:42,836][__main__][INFO] - Starting iteration 212. [2026-03-25 17:23:42,840][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:23:42,840][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:23:48,885][__main__][INFO] - Number of regex retries in iteration 212: 0 [2026-03-25 17:23:48,886][__main__][INFO] - agents played in iteration 212 are Bob, Alice [2026-03-25 17:23:49,466][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:23:49,528][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:23:49,529][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:23:49,529][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:23:50,291][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:23:50,912][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:23:51,570][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:23:52,231][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:23:52,889][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:23:53,549][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:23:54,206][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:23:54,864][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:23:55,521][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:23:56,178][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:23:56,836][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:23:57,492][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:23:58,208][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:23:58,876][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:23:59,534][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:24:00,190][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:24:00,848][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:24:01,505][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:24:02,162][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:24:02,821][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:24:03,478][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:24:04,134][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:24:04,791][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:24:05,449][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:24:06,106][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:24:06,764][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:24:07,422][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:24:08,079][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:24:08,736][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:24:09,393][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:24:10,050][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:24:10,707][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:24:11,364][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:24:12,021][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:24:12,678][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:24:13,336][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:24:13,994][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:24:14,652][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:24:15,310][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:24:15,967][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:24:16,626][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:24:17,284][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:24:17,942][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:24:18,599][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:24:19,257][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:24:19,914][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:24:20,571][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:24:21,229][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:24:22,221][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:24:22,882][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:24:23,542][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:24:24,200][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:24:24,858][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:24:25,517][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:24:26,174][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:24:26,833][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:24:27,491][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:24:28,148][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:24:28,806][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:24:29,465][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:24:30,123][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:24:30,782][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:24:31,441][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:24:32,101][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:24:32,761][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:24:33,685][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:24:35,114][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:24:35,116][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:24:35,118][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:24:36,521][__main__][INFO] - Iteration 213 took 53s (11.26% Gen, 86.12% Train). Generation: 6s, Training: 46s. Estimated remaining time: 11h 36m 12s. Estimated total time: 14h 54m 43s. Time estimates for 10 more iterations: 8m 56s, 100 more iterations: 1h 29m 28s, 500 more iterations: 7h 27m 21s. [2026-03-25 17:24:36,523][__main__][INFO] - Starting iteration 213. [2026-03-25 17:24:36,527][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:24:36,528][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:24:41,322][__main__][INFO] - Number of regex retries in iteration 213: 0 [2026-03-25 17:24:41,322][__main__][INFO] - agents played in iteration 213 are Bob, Alice [2026-03-25 17:24:41,808][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:24:41,869][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:24:41,870][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:24:41,870][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:24:42,665][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:24:43,281][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:24:43,939][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:24:44,600][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:24:45,259][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:24:45,917][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:24:46,575][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:24:47,234][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:24:47,892][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:24:48,549][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:24:49,206][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:24:49,863][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:24:50,523][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:24:51,180][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:24:51,837][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:24:52,495][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:24:53,153][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:24:53,810][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:24:54,467][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:24:55,125][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:24:55,782][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:24:56,439][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:24:57,097][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:24:57,754][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:24:58,412][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:24:59,070][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:24:59,728][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:25:00,385][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:25:01,044][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:25:01,701][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:25:02,359][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:25:03,017][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:25:03,675][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:25:04,333][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:25:04,990][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:25:05,647][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:25:06,305][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:25:06,962][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:25:07,621][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:25:08,279][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:25:08,936][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:25:09,594][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:25:10,251][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:25:10,909][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:25:11,566][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:25:12,224][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:25:12,881][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:25:13,539][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:25:14,523][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:25:15,181][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:25:15,839][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:25:16,497][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:25:17,154][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:25:17,813][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:25:18,470][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:25:19,127][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:25:19,786][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:25:20,444][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:25:21,103][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:25:21,759][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:25:22,417][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:25:23,075][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:25:23,734][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:25:24,391][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:25:25,051][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:25:25,782][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:25:27,178][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:25:27,180][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:25:27,182][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:25:28,569][__main__][INFO] - Iteration 214 took 52s (9.21% Gen, 88.12% Train). Generation: 4s, Training: 45s. Estimated remaining time: 11h 8m 0s. Estimated total time: 14h 27m 23s. Time estimates for 10 more iterations: 8m 40s, 100 more iterations: 1h 26m 44s, 500 more iterations: 7h 13m 41s. [2026-03-25 17:25:28,571][__main__][INFO] - Starting iteration 214. [2026-03-25 17:25:28,574][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:25:28,575][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:25:33,640][__main__][INFO] - Number of regex retries in iteration 214: 0 [2026-03-25 17:25:33,641][__main__][INFO] - agents played in iteration 214 are Bob, Alice [2026-03-25 17:25:34,121][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:25:34,182][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:25:34,183][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:25:34,183][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:25:34,989][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:25:35,606][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:25:36,267][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:25:36,923][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:25:37,581][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:25:38,239][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:25:38,896][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:25:39,554][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:25:40,212][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:25:40,868][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:25:41,527][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:25:42,185][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:25:42,843][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:25:43,502][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:25:44,158][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:25:44,818][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:25:45,476][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:25:46,133][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:25:46,791][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:25:47,449][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:25:48,107][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:25:48,764][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:25:49,424][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:25:50,082][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:25:50,740][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:25:51,398][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:25:52,056][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:25:52,714][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:25:53,372][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:25:54,030][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:25:54,688][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:25:55,345][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:25:56,004][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:25:56,661][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:25:57,318][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:25:57,975][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:25:58,634][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:25:59,291][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:25:59,948][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:26:00,605][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:26:01,264][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:26:01,921][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:26:02,578][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:26:03,235][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:26:03,892][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:26:04,549][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:26:05,206][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:26:05,864][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:26:06,856][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:26:07,514][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:26:08,171][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:26:08,831][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:26:09,488][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:26:10,146][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:26:10,806][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:26:11,463][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:26:12,120][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:26:12,777][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:26:13,434][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:26:14,091][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:26:14,748][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:26:15,406][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:26:16,063][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:26:16,720][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:26:17,378][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:26:18,238][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:26:19,759][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:26:19,762][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:26:19,763][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:26:21,312][__main__][INFO] - Iteration 215 took 52s (9.61% Gen, 87.45% Train). Generation: 5s, Training: 46s. Estimated remaining time: 11h 18m 42s. Estimated total time: 14h 38m 58s. Time estimates for 10 more iterations: 8m 47s, 100 more iterations: 1h 27m 53s, 500 more iterations: 7h 19m 29s. [2026-03-25 17:26:21,314][__main__][INFO] - Starting iteration 215. [2026-03-25 17:26:21,319][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:26:21,319][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:26:27,503][__main__][INFO] - Number of regex retries in iteration 215: 0 [2026-03-25 17:26:27,504][__main__][INFO] - agents played in iteration 215 are Bob, Alice [2026-03-25 17:26:28,584][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:26:28,645][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:26:28,646][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:26:28,646][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:26:29,458][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:26:30,066][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:26:30,728][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:26:31,389][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:26:32,046][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:26:32,705][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:26:33,366][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:26:34,024][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:26:34,683][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:26:35,343][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:26:36,001][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:26:36,660][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:26:37,320][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:26:37,978][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:26:38,638][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:26:39,296][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:26:39,956][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:26:40,615][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:26:41,275][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:26:41,934][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:26:42,592][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:26:43,249][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:26:43,907][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:26:44,566][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:26:45,225][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:26:45,884][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:26:46,542][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:26:47,202][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:26:47,861][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:26:48,519][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:26:49,178][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:26:49,837][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:26:50,495][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:26:51,153][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:26:51,811][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:26:52,470][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:26:53,128][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:26:53,786][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:26:54,444][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:26:55,102][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:26:55,760][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:26:56,418][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:26:57,077][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:26:57,734][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:26:58,394][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:26:59,051][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:26:59,709][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:27:00,368][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:27:01,471][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:27:02,130][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:27:02,787][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:27:03,445][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:27:04,104][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:27:04,761][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:27:05,419][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:27:06,077][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:27:06,734][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:27:07,392][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:27:08,050][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:27:08,708][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:27:09,365][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:27:10,024][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:27:10,681][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:27:11,338][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:27:11,996][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:27:12,824][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:27:14,893][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:27:14,896][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:27:14,897][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:27:16,238][__main__][INFO] - Iteration 216 took 54s (11.26% Gen, 86.29% Train). Generation: 6s, Training: 47s. Estimated remaining time: 11h 54m 10s. Estimated total time: 15h 15m 21s. Time estimates for 10 more iterations: 9m 9s, 100 more iterations: 1h 31m 32s, 500 more iterations: 7h 37m 40s. [2026-03-25 17:27:16,240][__main__][INFO] - Starting iteration 216. [2026-03-25 17:27:16,245][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:27:16,245][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:27:22,023][__main__][INFO] - Number of regex retries in iteration 216: 0 [2026-03-25 17:27:22,024][__main__][INFO] - agents played in iteration 216 are Bob, Alice [2026-03-25 17:27:22,843][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:27:22,904][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:27:22,905][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:27:22,905][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:27:23,654][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:27:24,276][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:27:24,933][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:27:25,592][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:27:26,251][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:27:26,908][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:27:27,566][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:27:28,224][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:27:28,883][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:27:29,540][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:27:30,199][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:27:30,856][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:27:31,515][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:27:32,173][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:27:32,830][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:27:33,490][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:27:34,153][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:27:34,807][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:27:35,468][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:27:36,124][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:27:36,782][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:27:37,439][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:27:38,096][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:27:38,753][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:27:39,411][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:27:40,070][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:27:40,727][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:27:41,385][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:27:42,043][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:27:42,701][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:27:43,359][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:27:44,016][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:27:44,673][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:27:45,331][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:27:45,988][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:27:46,646][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:27:47,304][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:27:47,961][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:27:48,619][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:27:49,277][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:27:49,935][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:27:50,592][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:27:51,249][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:27:51,907][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:27:52,564][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:27:53,222][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:27:53,880][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:27:54,537][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:27:55,543][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:27:56,201][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:27:56,859][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:27:57,517][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:27:58,175][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:27:58,835][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:27:59,493][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:28:00,154][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:28:00,809][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:28:01,467][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:28:02,125][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:28:02,782][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:28:03,440][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:28:04,097][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:28:04,754][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:28:05,412][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:28:06,069][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:28:06,869][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:28:08,408][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:28:08,411][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:28:08,412][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:28:09,810][__main__][INFO] - Iteration 217 took 53s (10.79% Gen, 86.60% Train). Generation: 5s, Training: 46s. Estimated remaining time: 11h 30m 42s. Estimated total time: 14h 52m 47s. Time estimates for 10 more iterations: 8m 55s, 100 more iterations: 1h 29m 16s, 500 more iterations: 7h 26m 23s. [2026-03-25 17:28:09,812][__main__][INFO] - Starting iteration 217. [2026-03-25 17:28:09,818][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:28:09,818][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:28:14,930][__main__][INFO] - Number of regex retries in iteration 217: 0 [2026-03-25 17:28:14,930][__main__][INFO] - agents played in iteration 217 are Bob, Alice [2026-03-25 17:28:15,390][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:28:15,450][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:28:15,451][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:28:15,451][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:28:16,320][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:28:16,928][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:28:17,585][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:28:18,243][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:28:18,901][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:28:19,558][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:28:20,215][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:28:20,872][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:28:21,537][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:28:22,191][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:28:22,848][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:28:23,507][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:28:24,164][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:28:24,821][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:28:25,478][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:28:26,136][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:28:26,794][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:28:27,453][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:28:28,111][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:28:28,769][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:28:29,426][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:28:30,083][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:28:30,740][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:28:31,397][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:28:32,054][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:28:32,712][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:28:33,369][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:28:34,026][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:28:34,684][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:28:35,343][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:28:36,000][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:28:36,657][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:28:37,315][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:28:37,972][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:28:38,629][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:28:39,286][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:28:39,944][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:28:40,601][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:28:41,259][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:28:41,916][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:28:42,573][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:28:43,230][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:28:43,888][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:28:44,546][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:28:45,205][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:28:45,863][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:28:46,520][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:28:47,178][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:28:48,173][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:28:48,832][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:28:49,489][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:28:50,146][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:28:50,804][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:28:51,462][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:28:52,119][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:28:52,777][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:28:53,436][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:28:54,094][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:28:54,751][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:28:55,411][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:28:56,068][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:28:56,726][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:28:57,386][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:28:58,045][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:28:58,704][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:28:59,404][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:29:00,895][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:29:00,898][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:29:00,899][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:29:02,292][__main__][INFO] - Iteration 218 took 52s (9.74% Gen, 87.60% Train). Generation: 5s, Training: 45s. Estimated remaining time: 11h 11m 39s. Estimated total time: 14h 34m 36s. Time estimates for 10 more iterations: 8m 44s, 100 more iterations: 1h 27m 27s, 500 more iterations: 7h 17m 18s. [2026-03-25 17:29:02,295][__main__][INFO] - Starting iteration 218. [2026-03-25 17:29:02,298][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:29:02,299][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:29:08,003][__main__][INFO] - Number of regex retries in iteration 218: 0 [2026-03-25 17:29:08,003][__main__][INFO] - agents played in iteration 218 are Bob, Alice [2026-03-25 17:29:08,588][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:29:08,649][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:29:08,650][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:29:08,650][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:29:09,447][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:29:10,055][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:29:10,714][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:29:11,371][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:29:12,031][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:29:12,688][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:29:13,346][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:29:14,004][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:29:14,661][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:29:15,320][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:29:15,978][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:29:16,635][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:29:17,293][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:29:17,954][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:29:18,612][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:29:19,268][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:29:19,926][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:29:20,584][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:29:21,241][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:29:21,901][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:29:22,558][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:29:23,215][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:29:23,873][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:29:24,530][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:29:25,188][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:29:25,845][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:29:26,503][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:29:27,160][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:29:27,819][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:29:28,477][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:29:29,135][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:29:29,793][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:29:30,450][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:29:31,108][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:29:31,766][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:29:32,423][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:29:33,082][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:29:33,740][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:29:34,398][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:29:35,055][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:29:35,713][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:29:36,369][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:29:37,027][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:29:37,684][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:29:38,341][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:29:39,000][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:29:39,658][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:29:40,315][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:29:41,310][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:29:41,968][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:29:42,625][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:29:43,282][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:29:43,941][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:29:44,598][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:29:45,257][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:29:45,914][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:29:46,571][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:29:47,229][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:29:47,886][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:29:48,544][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:29:49,202][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:29:49,859][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:29:50,516][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:29:51,174][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:29:51,833][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:29:52,665][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:29:54,683][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:29:54,686][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:29:54,687][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:29:56,075][__main__][INFO] - Iteration 219 took 53s (10.61% Gen, 86.81% Train). Generation: 5s, Training: 46s. Estimated remaining time: 11h 32m 27s. Estimated total time: 14h 56m 18s. Time estimates for 10 more iterations: 8m 57s, 100 more iterations: 1h 29m 37s, 500 more iterations: 7h 28m 9s. [2026-03-25 17:29:56,077][__main__][INFO] - Starting iteration 219. [2026-03-25 17:29:56,082][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:29:56,083][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:30:02,038][__main__][INFO] - Number of regex retries in iteration 219: 0 [2026-03-25 17:30:02,040][__main__][INFO] - agents played in iteration 219 are Bob, Alice [2026-03-25 17:30:02,516][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:30:02,578][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:30:02,579][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:30:02,579][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:30:03,297][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:30:03,905][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:30:04,563][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:30:05,221][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:30:05,882][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:30:06,537][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:30:07,194][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:30:07,850][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:30:08,507][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:30:09,166][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:30:09,823][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:30:10,480][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:30:11,137][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:30:11,796][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:30:12,453][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:30:13,110][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:30:13,767][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:30:14,423][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:30:15,080][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:30:15,738][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:30:16,396][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:30:17,055][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:30:17,712][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:30:18,369][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:30:19,026][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:30:19,683][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:30:20,340][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:30:20,999][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:30:21,657][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:30:22,314][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:30:22,971][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:30:23,629][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:30:24,285][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:30:24,943][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:30:25,600][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:30:26,257][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:30:26,914][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:30:27,570][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:30:28,227][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:30:28,886][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:30:29,543][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:30:30,201][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:30:30,859][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:30:31,517][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:30:32,175][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:30:32,832][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:30:33,489][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:30:34,146][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:30:35,135][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:30:35,793][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:30:36,451][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:30:37,111][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:30:37,769][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:30:38,426][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:30:39,086][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:30:39,744][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:30:40,403][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:30:41,061][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:30:41,718][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:30:42,375][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:30:43,034][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:30:43,691][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:30:44,350][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:30:45,007][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:30:45,666][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:30:46,384][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:30:47,757][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:30:47,760][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:30:47,761][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:30:49,190][__main__][INFO] - Iteration 220 took 53s (11.22% Gen, 86.09% Train). Generation: 5s, Training: 45s. Estimated remaining time: 11h 20m 26s. Estimated total time: 14h 45m 10s. Time estimates for 10 more iterations: 8m 51s, 100 more iterations: 1h 28m 31s, 500 more iterations: 7h 22m 35s. [2026-03-25 17:30:49,192][__main__][INFO] - Starting iteration 220. [2026-03-25 17:30:49,196][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:30:49,196][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:30:54,245][__main__][INFO] - Number of regex retries in iteration 220: 0 [2026-03-25 17:30:54,246][__main__][INFO] - agents played in iteration 220 are Bob, Alice [2026-03-25 17:30:54,709][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:30:54,770][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:30:54,771][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:30:54,771][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:30:55,561][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:30:56,189][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:30:56,848][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:30:57,505][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:30:58,163][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:30:58,820][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:30:59,478][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:31:00,135][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:31:00,793][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:31:01,450][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:31:02,107][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:31:02,765][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:31:03,422][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:31:04,080][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:31:04,737][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:31:05,394][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:31:06,052][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:31:06,710][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:31:07,370][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:31:08,027][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:31:08,684][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:31:09,341][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:31:09,998][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:31:10,655][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:31:11,313][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:31:11,971][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:31:12,628][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:31:13,285][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:31:13,944][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:31:14,601][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:31:15,258][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:31:15,915][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:31:16,572][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:31:17,229][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:31:17,886][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:31:18,549][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:31:19,205][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:31:19,862][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:31:20,520][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:31:21,178][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:31:21,836][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:31:22,494][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:31:23,151][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:31:23,808][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:31:24,466][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:31:25,123][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:31:25,781][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:31:26,438][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:31:27,433][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:31:28,093][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:31:28,755][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:31:29,411][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:31:30,068][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:31:30,727][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:31:31,385][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:31:32,042][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:31:32,700][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:31:33,359][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:31:34,017][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:31:34,676][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:31:35,334][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:31:35,991][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:31:36,648][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:31:37,305][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:31:37,963][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:31:38,758][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:31:40,151][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:31:40,154][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:31:40,155][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:31:41,633][__main__][INFO] - Iteration 221 took 52s (9.63% Gen, 87.55% Train). Generation: 5s, Training: 45s. Estimated remaining time: 11h 8m 22s. Estimated total time: 14h 33m 59s. Time estimates for 10 more iterations: 8m 44s, 100 more iterations: 1h 27m 23s, 500 more iterations: 7h 16m 59s. [2026-03-25 17:31:41,635][__main__][INFO] - Starting iteration 221. [2026-03-25 17:31:41,640][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:31:41,640][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:31:47,387][__main__][INFO] - Number of regex retries in iteration 221: 0 [2026-03-25 17:31:47,388][__main__][INFO] - agents played in iteration 221 are Bob, Alice [2026-03-25 17:31:48,492][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:31:48,553][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:31:48,554][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:31:48,555][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:31:49,385][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:31:49,995][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:31:50,654][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:31:51,311][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:31:51,969][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:31:52,626][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:31:53,285][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:31:53,942][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:31:54,599][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:31:55,256][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:31:55,918][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:31:56,576][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:31:57,234][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:31:57,892][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:31:58,549][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:31:59,206][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:31:59,863][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:32:00,519][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:32:01,177][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:32:01,834][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:32:02,492][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:32:03,149][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:32:03,807][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:32:04,464][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:32:05,122][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:32:05,778][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:32:06,435][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:32:07,093][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:32:07,751][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:32:08,406][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:32:09,063][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:32:09,721][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:32:10,379][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:32:11,037][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:32:11,694][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:32:12,354][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:32:13,012][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:32:13,669][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:32:14,326][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:32:14,983][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:32:15,640][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:32:16,298][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:32:16,956][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:32:17,613][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:32:18,271][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:32:18,929][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:32:19,586][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:32:20,243][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:32:21,237][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:32:21,896][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:32:22,554][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:32:23,211][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:32:23,869][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:32:24,526][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:32:25,183][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:32:25,842][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:32:26,499][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:32:27,156][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:32:27,813][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:32:28,470][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:32:29,129][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:32:29,787][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:32:30,445][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:32:31,102][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:32:31,761][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:32:32,661][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:32:34,073][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:32:34,076][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:32:34,077][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:32:35,709][__main__][INFO] - Iteration 222 took 54s (10.63% Gen, 86.35% Train). Generation: 5s, Training: 46s. Estimated remaining time: 11h 34m 41s. Estimated total time: 15h 1m 11s. Time estimates for 10 more iterations: 9m 0s, 100 more iterations: 1h 30m 7s, 500 more iterations: 7h 30m 35s. [2026-03-25 17:32:35,712][__main__][INFO] - Starting iteration 222. [2026-03-25 17:32:35,716][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:32:35,717][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:32:45,735][__main__][INFO] - Number of regex retries in iteration 222: 0 [2026-03-25 17:32:45,736][__main__][INFO] - agents played in iteration 222 are Bob, Alice [2026-03-25 17:32:46,799][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:32:46,861][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:32:46,862][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:32:46,862][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:32:47,521][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:32:48,130][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:32:48,788][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:32:49,445][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:32:50,102][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:32:50,759][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:32:51,416][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:32:52,073][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:32:52,732][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:32:53,390][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:32:54,047][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:32:54,705][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:32:55,362][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:32:56,019][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:32:56,677][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:32:57,334][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:32:57,991][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:32:58,649][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:32:59,307][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:32:59,965][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:33:00,623][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:33:01,281][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:33:01,938][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:33:02,595][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:33:03,252][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:33:03,910][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:33:04,568][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:33:05,225][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:33:05,882][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:33:06,539][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:33:07,196][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:33:07,853][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:33:08,511][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:33:09,169][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:33:09,827][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:33:10,484][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:33:11,143][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:33:11,800][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:33:12,457][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:33:13,115][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:33:13,772][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:33:14,429][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:33:15,085][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:33:15,742][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:33:16,399][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:33:17,056][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:33:17,715][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:33:18,372][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:33:19,353][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:33:20,012][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:33:20,669][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:33:21,327][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:33:21,984][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:33:22,642][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:33:23,300][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:33:23,957][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:33:24,615][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:33:25,272][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:33:25,930][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:33:26,587][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:33:27,245][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:33:27,902][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:33:28,560][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:33:29,217][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:33:29,874][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:33:30,725][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:33:32,091][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:33:32,094][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:33:32,095][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:33:33,550][__main__][INFO] - Iteration 223 took 57s (17.32% Gen, 80.16% Train). Generation: 10s, Training: 46s. Estimated remaining time: 12h 36m 26s. Estimated total time: 16h 3m 54s. Time estimates for 10 more iterations: 9m 38s, 100 more iterations: 1h 36m 23s, 500 more iterations: 8h 1m 57s. [2026-03-25 17:33:33,552][__main__][INFO] - Starting iteration 223. [2026-03-25 17:33:33,556][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:33:33,557][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:33:38,999][__main__][INFO] - Number of regex retries in iteration 223: 0 [2026-03-25 17:33:39,001][__main__][INFO] - agents played in iteration 223 are Bob, Alice [2026-03-25 17:33:39,581][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:33:39,649][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:33:39,649][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:33:39,650][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:33:40,369][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:33:40,998][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:33:41,656][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:33:42,314][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:33:42,971][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:33:43,629][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:33:44,286][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:33:44,944][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:33:45,601][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:33:46,258][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:33:46,915][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:33:47,572][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:33:48,230][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:33:48,887][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:33:49,544][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:33:50,201][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:33:50,858][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:33:51,516][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:33:52,173][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:33:52,830][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:33:53,487][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:33:54,145][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:33:54,802][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:33:55,459][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:33:56,116][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:33:56,773][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:33:57,431][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:33:58,088][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:33:58,746][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:33:59,404][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:34:00,061][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:34:00,719][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:34:01,376][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:34:02,034][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:34:02,691][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:34:03,348][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:34:04,006][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:34:04,662][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:34:05,320][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:34:05,977][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:34:06,635][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:34:07,292][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:34:07,950][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:34:08,607][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:34:09,264][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:34:09,922][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:34:10,579][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:34:11,237][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:34:12,223][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:34:12,884][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:34:13,542][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:34:14,200][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:34:14,859][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:34:15,517][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:34:16,177][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:34:16,834][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:34:17,492][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:34:18,149][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:34:18,808][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:34:19,465][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:34:20,122][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:34:20,780][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:34:21,438][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:34:22,095][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:34:22,752][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:34:23,527][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:34:25,132][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:34:25,135][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:34:25,136][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:34:26,616][__main__][INFO] - Iteration 224 took 53s (10.26% Gen, 86.95% Train). Generation: 5s, Training: 46s. Estimated remaining time: 11h 16m 0s. Estimated total time: 14h 44m 21s. Time estimates for 10 more iterations: 8m 50s, 100 more iterations: 1h 28m 26s, 500 more iterations: 7h 22m 10s. [2026-03-25 17:34:26,619][__main__][INFO] - Starting iteration 224. [2026-03-25 17:34:26,622][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:34:26,623][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:34:31,783][__main__][INFO] - Number of regex retries in iteration 224: 0 [2026-03-25 17:34:31,784][__main__][INFO] - agents played in iteration 224 are Bob, Alice [2026-03-25 17:34:32,259][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:34:32,320][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:34:32,320][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:34:32,321][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:34:33,114][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:34:33,768][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:34:34,422][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:34:35,085][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:34:35,743][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:34:36,400][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:34:37,058][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:34:37,715][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:34:38,373][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:34:39,030][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:34:39,689][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:34:40,346][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:34:41,003][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:34:41,662][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:34:42,319][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:34:42,977][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:34:43,639][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:34:44,296][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:34:44,953][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:34:45,610][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:34:46,267][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:34:46,924][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:34:47,582][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:34:48,239][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:34:48,897][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:34:49,554][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:34:50,213][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:34:50,870][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:34:51,527][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:34:52,184][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:34:52,841][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:34:53,499][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:34:54,156][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:34:54,814][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:34:55,472][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:34:56,129][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:34:56,788][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:34:57,446][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:34:58,103][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:34:58,761][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:34:59,418][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:35:00,075][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:35:00,732][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:35:01,390][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:35:02,047][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:35:02,704][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:35:03,361][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:35:04,018][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:35:05,005][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:35:05,663][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:35:06,321][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:35:06,978][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:35:07,635][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:35:08,294][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:35:08,950][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:35:09,607][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:35:10,266][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:35:10,923][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:35:11,582][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:35:12,239][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:35:12,898][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:35:13,554][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:35:14,213][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:35:14,872][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:35:15,530][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:35:16,246][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:35:17,727][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:35:17,730][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:35:17,733][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:35:19,237][__main__][INFO] - Iteration 225 took 52s (9.81% Gen, 87.33% Train). Generation: 5s, Training: 45s. Estimated remaining time: 11h 7m 42s. Estimated total time: 14h 36m 56s. Time estimates for 10 more iterations: 8m 46s, 100 more iterations: 1h 27m 41s, 500 more iterations: 7h 18m 28s. [2026-03-25 17:35:19,240][__main__][INFO] - Starting iteration 225. [2026-03-25 17:35:19,244][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:35:19,244][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:35:24,473][__main__][INFO] - Number of regex retries in iteration 225: 0 [2026-03-25 17:35:24,474][__main__][INFO] - agents played in iteration 225 are Bob, Alice [2026-03-25 17:35:25,086][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:35:25,147][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:35:25,147][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:35:25,148][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:35:26,009][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:35:26,645][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:35:27,304][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:35:27,963][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:35:28,622][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:35:29,277][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:35:29,935][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:35:30,593][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:35:31,251][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:35:31,908][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:35:32,565][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:35:33,223][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:35:33,880][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:35:34,537][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:35:35,194][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:35:35,853][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:35:36,510][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:35:37,167][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:35:37,826][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:35:38,484][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:35:39,142][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:35:39,800][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:35:40,457][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:35:41,116][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:35:41,774][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:35:42,431][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:35:43,088][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:35:43,745][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:35:44,402][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:35:45,059][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:35:45,717][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:35:46,375][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:35:47,032][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:35:47,689][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:35:48,352][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:35:49,009][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:35:49,667][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:35:50,324][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:35:50,982][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:35:51,639][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:35:52,296][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:35:52,954][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:35:53,611][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:35:54,269][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:35:54,926][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:35:55,584][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:35:56,242][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:35:56,899][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:35:57,886][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:35:58,546][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:35:59,205][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:35:59,863][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:36:00,520][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:36:01,177][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:36:01,835][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:36:02,492][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:36:03,149][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:36:03,806][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:36:04,466][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:36:05,122][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:36:05,780][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:36:06,437][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:36:07,094][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:36:07,752][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:36:08,409][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:36:09,183][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:36:10,574][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:36:10,576][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:36:10,578][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:36:11,988][__main__][INFO] - Iteration 226 took 52s (9.91% Gen, 87.41% Train). Generation: 5s, Training: 46s. Estimated remaining time: 11h 8m 59s. Estimated total time: 14h 39m 6s. Time estimates for 10 more iterations: 8m 47s, 100 more iterations: 1h 27m 54s, 500 more iterations: 7h 19m 33s. [2026-03-25 17:36:11,991][__main__][INFO] - Starting iteration 226. [2026-03-25 17:36:11,995][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:36:11,996][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:36:18,442][__main__][INFO] - Number of regex retries in iteration 226: 0 [2026-03-25 17:36:18,443][__main__][INFO] - agents played in iteration 226 are Bob, Alice [2026-03-25 17:36:18,910][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:36:18,971][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:36:18,971][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:36:18,972][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:36:19,788][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:36:20,397][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:36:21,056][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:36:21,713][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:36:22,372][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:36:23,031][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:36:23,690][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:36:24,348][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:36:25,006][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:36:25,664][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:36:26,322][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:36:26,980][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:36:27,641][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:36:28,301][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:36:28,962][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:36:29,620][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:36:30,278][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:36:30,936][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:36:31,594][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:36:32,252][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:36:32,909][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:36:33,567][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:36:34,225][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:36:34,882][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:36:35,539][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:36:36,197][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:36:36,855][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:36:37,512][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:36:38,169][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:36:38,826][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:36:39,484][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:36:40,141][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:36:40,799][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:36:41,457][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:36:42,114][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:36:42,773][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:36:43,431][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:36:44,088][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:36:44,745][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:36:45,403][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:36:46,060][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:36:46,717][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:36:47,374][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:36:48,031][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:36:48,689][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:36:49,346][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:36:50,004][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:36:50,666][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:36:51,675][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:36:52,333][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:36:52,990][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:36:53,648][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:36:54,305][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:36:54,964][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:36:55,622][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:36:56,280][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:36:56,938][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:36:57,595][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:36:58,252][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:36:58,910][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:36:59,568][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:37:00,226][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:37:00,884][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:37:01,542][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:37:02,201][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:37:02,991][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:37:04,553][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:37:04,556][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:37:04,557][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:37:06,003][__main__][INFO] - Iteration 227 took 54s (11.94% Gen, 85.38% Train). Generation: 6s, Training: 46s. Estimated remaining time: 11h 29m 8s. Estimated total time: 15h 0m 9s. Time estimates for 10 more iterations: 9m 0s, 100 more iterations: 1h 30m 0s, 500 more iterations: 7h 30m 4s. [2026-03-25 17:37:06,005][__main__][INFO] - Starting iteration 227. [2026-03-25 17:37:06,009][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:37:06,010][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:37:12,304][__main__][INFO] - Number of regex retries in iteration 227: 0 [2026-03-25 17:37:12,306][__main__][INFO] - agents played in iteration 227 are Bob, Alice [2026-03-25 17:37:13,375][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:37:13,440][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:37:13,441][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:37:13,442][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:37:14,135][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:37:14,749][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:37:15,406][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:37:16,064][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:37:16,721][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:37:17,379][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:37:18,039][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:37:18,695][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:37:19,353][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:37:20,011][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:37:20,669][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:37:21,326][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:37:21,984][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:37:22,641][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:37:23,298][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:37:23,955][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:37:24,612][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:37:25,270][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:37:25,927][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:37:26,584][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:37:27,241][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:37:27,898][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:37:28,557][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:37:29,214][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:37:29,871][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:37:30,528][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:37:31,186][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:37:31,843][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:37:32,502][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:37:33,159][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:37:33,816][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:37:34,474][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:37:35,131][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:37:35,788][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:37:36,444][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:37:37,101][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:37:37,758][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:37:38,417][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:37:39,074][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:37:39,731][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:37:40,389][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:37:41,046][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:37:41,703][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:37:42,362][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:37:43,020][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:37:43,677][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:37:44,334][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:37:44,992][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:37:45,984][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:37:46,641][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:37:47,299][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:37:47,956][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:37:48,614][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:37:49,273][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:37:49,930][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:37:50,588][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:37:51,245][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:37:51,903][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:37:52,560][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:37:53,219][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:37:53,877][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:37:54,534][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:37:55,193][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:37:55,851][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:37:56,509][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:37:57,290][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:37:58,909][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:37:58,912][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:37:58,922][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:38:00,382][__main__][INFO] - Iteration 228 took 54s (11.58% Gen, 85.73% Train). Generation: 6s, Training: 46s. Estimated remaining time: 11h 34m 19s. Estimated total time: 15h 6m 14s. Time estimates for 10 more iterations: 9m 3s, 100 more iterations: 1h 30m 37s, 500 more iterations: 7h 33m 7s. [2026-03-25 17:38:00,384][__main__][INFO] - Starting iteration 228. [2026-03-25 17:38:00,390][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:38:00,390][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:38:24,763][__main__][INFO] - Number of regex retries in iteration 228: 0 [2026-03-25 17:38:24,764][__main__][INFO] - agents played in iteration 228 are Bob, Alice [2026-03-25 17:38:25,369][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:38:25,430][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:38:25,431][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:38:25,431][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:38:26,273][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:38:26,901][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:38:27,559][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:38:28,218][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:38:28,876][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:38:29,533][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:38:30,190][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:38:30,848][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:38:31,504][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:38:32,162][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:38:32,818][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:38:33,475][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:38:34,131][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:38:34,789][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:38:35,445][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:38:36,103][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:38:36,760][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:38:37,417][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:38:38,077][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:38:38,734][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:38:39,391][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:38:40,047][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:38:40,704][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:38:41,361][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:38:42,019][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:38:42,676][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:38:43,334][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:38:43,991][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:38:44,648][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:38:45,305][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:38:45,963][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:38:46,621][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:38:47,278][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:38:47,935][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:38:48,592][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:38:49,249][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:38:49,906][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:38:50,563][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:38:51,220][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:38:51,877][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:38:52,536][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:38:53,193][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:38:53,850][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:38:54,508][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:38:55,167][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:38:55,825][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:38:56,483][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:38:57,140][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:38:58,135][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:38:58,796][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:38:59,454][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:39:00,112][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:39:00,770][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:39:01,427][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:39:02,085][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:39:02,743][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:39:03,400][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:39:04,058][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:39:04,716][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:39:05,374][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:39:06,038][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:39:06,697][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:39:07,355][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:39:08,013][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:39:08,672][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:39:09,500][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:39:10,952][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:39:10,955][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:39:10,956][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:39:12,321][__main__][INFO] - Iteration 229 took 1m 11s (33.88% Gen, 64.21% Train). Generation: 24s, Training: 46s. Estimated remaining time: 16h 25m 47s. Estimated total time: 19h 58m 54s. Time estimates for 10 more iterations: 11m 59s, 100 more iterations: 1h 59m 53s, 500 more iterations: 9h 59m 27s. [2026-03-25 17:39:12,323][__main__][INFO] - Starting iteration 229. [2026-03-25 17:39:12,331][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:39:12,331][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:39:20,569][__main__][INFO] - Number of regex retries in iteration 229: 0 [2026-03-25 17:39:20,571][__main__][INFO] - agents played in iteration 229 are Bob, Alice [2026-03-25 17:39:21,159][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:39:21,220][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:39:21,220][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:39:21,221][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:39:22,011][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:39:22,623][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:39:23,283][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:39:23,938][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:39:24,594][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:39:25,251][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:39:25,913][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:39:26,576][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:39:27,226][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:39:27,885][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:39:28,544][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:39:29,202][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:39:29,861][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:39:30,518][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:39:31,176][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:39:31,834][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:39:32,491][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:39:33,148][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:39:33,807][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:39:34,465][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:39:35,123][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:39:35,780][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:39:36,438][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:39:37,096][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:39:37,753][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:39:38,410][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:39:39,068][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:39:39,727][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:39:40,385][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:39:41,044][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:39:41,702][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:39:42,359][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:39:43,016][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:39:43,674][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:39:44,331][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:39:44,990][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:39:45,648][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:39:46,305][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:39:46,962][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:39:47,619][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:39:48,276][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:39:48,933][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:39:49,590][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:39:53,642][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:39:54,299][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:39:54,958][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:39:55,616][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:39:56,276][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:39:57,268][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:39:57,928][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:39:58,587][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:39:59,247][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:39:59,904][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:40:00,561][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:40:01,218][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:40:01,875][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:40:02,532][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:40:03,189][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:40:03,846][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:40:04,503][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:40:05,160][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:40:05,818][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:40:06,475][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:40:07,132][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:40:07,789][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:40:08,675][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:46 [2026-03-25 17:40:10,027][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:40:10,030][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:40:10,031][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:40:11,530][__main__][INFO] - Iteration 230 took 59s (13.92% Gen, 83.54% Train). Generation: 8s, Training: 49s. Estimated remaining time: 12h 52m 36s. Estimated total time: 16h 26m 42s. Time estimates for 10 more iterations: 9m 52s, 100 more iterations: 1h 38m 40s, 500 more iterations: 8h 13m 21s. [2026-03-25 17:40:11,532][__main__][INFO] - Starting iteration 230. [2026-03-25 17:40:11,536][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:40:11,537][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:40:16,848][__main__][INFO] - Number of regex retries in iteration 230: 0 [2026-03-25 17:40:16,850][__main__][INFO] - agents played in iteration 230 are Bob, Alice [2026-03-25 17:40:17,339][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:40:17,400][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:40:17,400][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:40:17,401][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:40:18,218][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:40:18,827][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:40:19,491][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:40:20,145][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:40:20,804][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:40:21,461][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:40:22,120][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:40:22,778][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:40:23,435][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:40:24,093][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:40:24,751][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:40:25,409][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:40:26,067][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:40:26,725][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:40:27,384][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:40:28,042][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:40:28,699][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:40:29,356][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:40:30,014][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:40:30,671][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:40:31,328][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:40:31,986][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:40:32,643][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:40:33,302][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:40:33,959][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:40:34,617][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:40:35,273][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:40:35,931][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:40:36,588][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:40:37,245][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:40:37,903][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:40:38,564][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:40:39,224][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:40:39,880][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:40:40,539][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:40:41,199][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:40:41,856][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:40:42,513][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:40:43,170][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:40:43,827][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:40:44,484][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:40:45,141][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:40:45,798][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:40:46,454][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:40:47,112][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:40:47,768][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:40:48,428][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:40:49,086][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:40:50,082][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:40:50,742][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:40:51,399][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:40:52,056][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:40:52,714][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:40:53,371][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:40:54,029][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:40:54,687][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:40:55,345][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:40:56,002][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:40:56,660][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:40:57,319][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:40:57,977][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:40:58,635][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:40:59,293][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:40:59,950][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:41:00,607][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:41:01,401][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:41:02,971][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:41:02,974][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:41:02,975][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:41:04,488][__main__][INFO] - Iteration 231 took 52s (10.03% Gen, 87.10% Train). Generation: 5s, Training: 46s. Estimated remaining time: 11h 7m 35s. Estimated total time: 14h 42m 34s. Time estimates for 10 more iterations: 8m 49s, 100 more iterations: 1h 28m 15s, 500 more iterations: 7h 21m 17s. [2026-03-25 17:41:04,491][__main__][INFO] - Starting iteration 231. [2026-03-25 17:41:04,496][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:41:04,497][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:41:10,737][__main__][INFO] - Number of regex retries in iteration 231: 0 [2026-03-25 17:41:10,739][__main__][INFO] - agents played in iteration 231 are Bob, Alice [2026-03-25 17:41:11,567][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:41:11,628][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:41:11,629][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:41:11,629][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:41:12,338][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:41:12,947][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:41:13,603][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:41:14,260][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:41:14,917][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:41:15,574][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:41:16,230][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:41:16,889][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:41:17,545][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:41:18,205][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:41:18,862][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:41:19,520][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:41:20,177][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:41:20,835][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:41:21,492][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:41:22,149][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:41:22,806][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:41:23,463][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:41:24,120][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:41:24,777][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:41:25,435][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:41:26,093][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:41:26,750][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:41:27,407][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:41:28,064][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:41:28,722][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:41:29,379][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:41:30,037][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:41:30,694][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:41:31,351][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:41:32,008][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:41:32,664][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:41:33,321][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:41:33,978][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:41:34,636][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:41:35,294][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:41:35,951][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:41:36,608][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:41:37,265][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:41:37,923][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:41:38,580][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:41:39,238][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:41:39,895][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:41:40,552][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:41:41,210][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:41:41,867][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:41:42,524][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:41:43,182][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:41:44,164][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:41:44,823][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:41:45,481][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:41:46,138][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:41:46,796][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:41:47,455][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:41:48,113][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:41:48,771][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:41:49,428][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:41:50,085][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:41:50,743][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:41:51,400][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:41:52,057][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:41:52,715][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:41:53,374][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:41:54,032][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:41:54,691][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:41:55,471][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:41:56,835][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:41:56,837][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:41:56,839][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:41:58,245][__main__][INFO] - Iteration 232 took 53s (11.61% Gen, 85.76% Train). Generation: 6s, Training: 46s. Estimated remaining time: 11h 19m 58s. Estimated total time: 14h 55m 51s. Time estimates for 10 more iterations: 8m 57s, 100 more iterations: 1h 29m 35s, 500 more iterations: 7h 27m 55s. [2026-03-25 17:41:58,248][__main__][INFO] - Starting iteration 232. [2026-03-25 17:41:58,253][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:41:58,254][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:42:07,364][__main__][INFO] - Number of regex retries in iteration 232: 0 [2026-03-25 17:42:07,365][__main__][INFO] - agents played in iteration 232 are Bob, Alice [2026-03-25 17:42:08,431][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:42:08,493][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:42:08,493][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:42:08,494][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:42:09,315][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:42:09,931][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:42:10,590][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:42:11,247][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:42:11,903][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:42:12,561][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:42:13,218][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:42:13,875][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:42:14,533][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:42:15,189][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:42:15,846][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:42:16,504][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:42:17,160][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:42:17,820][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:42:18,478][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:42:19,136][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:42:19,794][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:42:20,451][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:42:21,109][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:42:21,767][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:42:22,424][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:42:23,081][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:42:23,740][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:42:24,397][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:42:25,054][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:42:25,711][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:42:26,370][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:42:27,027][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:42:27,684][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:42:28,342][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:42:28,999][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:42:29,656][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:42:30,313][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:42:30,972][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:42:31,630][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:42:32,286][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:42:32,943][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:42:33,600][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:42:34,257][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:42:34,915][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:42:35,572][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:42:36,230][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:42:36,887][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:42:37,544][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:42:38,201][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:42:38,858][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:42:39,515][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:42:40,172][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:42:41,164][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:42:41,823][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:42:42,481][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:42:43,138][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:42:43,795][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:42:44,453][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:42:45,110][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:42:45,768][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:42:46,427][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:42:47,084][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:42:47,742][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:42:48,400][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:42:49,059][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:42:49,716][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:42:50,373][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:42:51,031][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:42:51,688][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:42:52,610][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:42:54,206][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:42:54,209][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:42:54,210][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:42:55,707][__main__][INFO] - Iteration 233 took 57s (15.86% Gen, 81.53% Train). Generation: 9s, Training: 46s. Estimated remaining time: 12h 20m 46s. Estimated total time: 15h 57m 36s. Time estimates for 10 more iterations: 9m 34s, 100 more iterations: 1h 35m 45s, 500 more iterations: 7h 58m 48s. [2026-03-25 17:42:55,710][__main__][INFO] - Starting iteration 233. [2026-03-25 17:42:55,713][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:42:55,714][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:43:05,896][__main__][INFO] - Number of regex retries in iteration 233: 0 [2026-03-25 17:43:05,898][__main__][INFO] - agents played in iteration 233 are Bob, Alice [2026-03-25 17:43:06,498][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:43:06,559][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:43:06,560][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:43:06,560][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:43:07,393][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:43:07,998][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:43:08,657][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:43:09,315][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:43:09,973][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:43:10,631][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:43:11,288][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:43:11,945][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:43:12,602][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:43:13,259][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:43:13,916][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:43:14,573][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:43:15,230][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:43:15,888][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:43:16,545][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:43:17,202][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:43:17,859][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:43:18,518][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:43:19,175][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:43:19,832][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:43:20,489][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:43:21,148][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:43:21,805][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:43:22,461][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:43:23,118][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:43:23,775][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:43:24,432][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:43:25,089][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:43:25,746][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:43:26,403][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:43:27,061][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:43:27,718][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:43:28,377][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:43:29,035][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:43:29,693][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:43:30,350][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:43:31,007][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:43:31,665][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:43:32,322][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:43:32,980][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:43:33,637][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:43:34,293][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:43:34,950][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:43:35,607][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:43:36,265][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:43:36,923][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:43:37,580][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:43:38,238][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:43:39,227][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:43:39,886][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:43:40,547][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:43:41,204][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:43:41,861][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:43:42,518][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:43:43,175][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:43:43,832][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:43:44,489][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:43:45,146][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:43:45,805][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:43:46,463][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:43:47,123][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:43:47,777][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:43:48,434][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:43:49,091][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:43:49,751][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:43:50,509][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:43:51,895][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:43:51,899][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:43:51,900][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:43:53,379][__main__][INFO] - Iteration 234 took 57s (17.66% Gen, 79.77% Train). Generation: 10s, Training: 46s. Estimated remaining time: 12h 23m 19s. Estimated total time: 16h 1m 7s. Time estimates for 10 more iterations: 9m 36s, 100 more iterations: 1h 36m 6s, 500 more iterations: 8h 0m 33s. [2026-03-25 17:43:53,382][__main__][INFO] - Starting iteration 234. [2026-03-25 17:43:53,386][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:43:53,387][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:43:59,178][__main__][INFO] - Number of regex retries in iteration 234: 0 [2026-03-25 17:43:59,179][__main__][INFO] - agents played in iteration 234 are Bob, Alice [2026-03-25 17:43:59,644][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:43:59,706][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:43:59,707][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:43:59,707][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:44:00,451][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:44:01,086][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:44:01,747][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:44:02,405][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:44:03,063][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:44:03,721][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:44:04,380][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:44:05,038][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:44:05,697][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:44:06,354][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:44:07,013][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:44:07,672][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:44:08,330][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:44:08,988][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:44:09,646][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:44:10,306][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:44:10,964][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:44:11,622][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:44:12,279][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:44:12,937][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:44:13,596][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:44:14,255][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:44:14,912][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:44:15,571][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:44:16,230][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:44:16,888][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:44:17,546][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:44:18,205][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:44:18,863][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:44:19,521][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:44:20,180][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:44:20,838][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:44:21,497][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:44:22,155][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:44:22,814][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:44:23,472][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:44:24,130][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:44:24,788][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:44:25,446][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:44:26,105][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:44:26,762][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:44:27,420][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:44:28,078][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:44:28,737][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:44:29,396][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:44:30,055][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:44:30,713][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:44:31,371][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:44:32,359][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:44:33,017][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:44:33,674][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:44:34,331][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:44:34,988][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:44:35,646][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:44:36,303][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:44:36,961][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:44:37,618][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:44:38,275][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:44:38,932][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:44:39,589][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:44:40,247][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:44:40,905][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:44:41,562][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:44:42,219][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:44:42,877][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:44:43,658][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:44:45,050][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:44:45,053][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:44:45,054][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:44:46,420][__main__][INFO] - Iteration 235 took 53s (10.92% Gen, 86.50% Train). Generation: 5s, Training: 45s. Estimated remaining time: 11h 5m 15s. Estimated total time: 14h 43m 56s. Time estimates for 10 more iterations: 8m 50s, 100 more iterations: 1h 28m 23s, 500 more iterations: 7h 21m 58s. [2026-03-25 17:44:46,423][__main__][INFO] - Starting iteration 235. [2026-03-25 17:44:46,426][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:44:46,427][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:44:54,353][__main__][INFO] - Number of regex retries in iteration 235: 0 [2026-03-25 17:44:54,354][__main__][INFO] - agents played in iteration 235 are Bob, Alice [2026-03-25 17:44:54,938][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:44:54,999][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:44:55,000][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:44:55,000][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:44:55,766][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:44:56,390][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:44:57,046][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:44:57,702][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:44:58,359][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:44:59,015][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:44:59,674][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:45:00,331][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:45:00,988][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:45:01,645][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:45:02,302][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:45:02,959][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:45:03,616][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:45:04,275][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:45:04,933][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:45:05,591][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:45:06,249][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:45:06,906][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:45:07,563][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:45:08,220][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:45:08,883][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:45:09,539][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:45:10,196][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:45:10,853][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:45:11,511][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:45:12,169][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:45:12,827][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:45:13,485][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:45:14,142][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:45:14,804][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:45:15,461][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:45:16,120][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:45:16,778][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:45:17,436][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:45:18,094][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:45:18,751][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:45:19,409][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:45:20,068][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:45:20,726][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:45:21,384][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:45:22,041][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:45:22,700][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:45:23,357][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:45:24,014][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:45:24,672][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:45:25,330][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:45:25,990][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:45:26,647][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:45:27,630][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:45:28,289][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:45:28,947][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:45:29,605][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:45:30,262][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:45:30,920][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:45:31,578][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:45:32,235][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:45:32,893][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:45:33,550][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:45:34,207][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:45:34,865][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:45:35,522][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:45:36,179][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:45:36,838][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:45:37,496][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:45:38,155][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:45:38,982][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:45:40,363][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:45:40,366][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:45:40,368][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:45:41,790][__main__][INFO] - Iteration 236 took 55s (14.32% Gen, 83.11% Train). Generation: 7s, Training: 46s. Estimated remaining time: 11h 43m 9s. Estimated total time: 15h 22m 46s. Time estimates for 10 more iterations: 9m 13s, 100 more iterations: 1h 32m 16s, 500 more iterations: 7h 41m 23s. [2026-03-25 17:45:41,793][__main__][INFO] - Starting iteration 236. [2026-03-25 17:45:41,799][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:45:41,799][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:45:49,010][__main__][INFO] - Number of regex retries in iteration 236: 0 [2026-03-25 17:45:49,011][__main__][INFO] - agents played in iteration 236 are Bob, Alice [2026-03-25 17:45:50,055][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:45:50,117][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:45:50,117][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:45:50,118][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:45:50,889][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:45:51,506][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:45:52,165][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:45:52,822][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:45:53,478][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:45:54,136][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:45:54,793][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:45:55,450][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:45:56,106][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:45:56,762][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:45:57,418][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:45:58,075][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:45:58,731][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:45:59,387][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:46:00,044][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:46:00,700][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:46:01,357][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:46:02,014][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:46:02,670][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:46:03,327][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:46:03,984][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:46:04,641][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:46:05,298][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:46:05,957][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:46:06,614][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:46:07,273][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:46:07,931][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:46:08,589][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:46:09,246][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:46:09,905][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:46:10,563][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:46:11,220][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:46:11,877][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:46:12,534][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:46:13,194][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:46:13,851][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:46:14,508][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:46:15,166][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:46:15,824][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:46:16,481][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:46:17,138][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:46:17,797][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:46:18,455][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:46:19,112][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:46:19,769][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:46:20,425][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:46:21,083][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:46:21,740][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:46:22,724][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:46:23,382][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:46:24,039][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:46:24,696][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:46:25,353][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:46:26,011][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:46:26,669][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:46:27,326][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:46:27,983][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:46:28,641][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:46:29,298][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:46:29,956][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:46:30,614][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:46:31,272][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:46:31,930][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:46:32,588][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:46:33,246][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:46:34,032][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:46:35,414][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:46:35,417][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:46:35,418][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:46:37,245][__main__][INFO] - Iteration 237 took 55s (13.01% Gen, 83.69% Train). Generation: 7s, Training: 46s. Estimated remaining time: 11h 43m 36s. Estimated total time: 15h 24m 8s. Time estimates for 10 more iterations: 9m 14s, 100 more iterations: 1h 32m 24s, 500 more iterations: 7h 42m 4s. [2026-03-25 17:46:37,248][__main__][INFO] - Starting iteration 237. [2026-03-25 17:46:37,253][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:46:37,254][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:46:43,289][__main__][INFO] - Number of regex retries in iteration 237: 0 [2026-03-25 17:46:43,290][__main__][INFO] - agents played in iteration 237 are Bob, Alice [2026-03-25 17:46:44,230][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:46:44,291][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:46:44,291][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:46:44,292][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:46:45,003][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:46:45,619][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:46:46,277][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:46:46,933][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:46:47,591][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:46:48,248][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:46:48,906][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:46:49,564][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:46:50,222][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:46:50,879][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:46:51,539][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:46:52,196][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:46:52,854][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:46:53,511][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:46:54,168][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:46:54,825][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:46:55,483][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:46:56,141][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:46:56,798][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:46:57,455][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:46:58,112][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:46:58,770][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:46:59,428][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:47:00,086][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:47:00,743][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:47:01,400][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:47:02,059][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:47:02,716][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:47:03,373][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:47:04,031][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:47:04,687][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:47:05,345][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:47:06,003][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:47:06,661][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:47:07,318][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:47:07,975][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:47:08,634][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:47:09,291][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:47:09,948][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:47:10,605][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:47:11,262][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:47:11,919][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:47:12,576][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:47:13,233][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:47:13,891][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:47:14,548][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:47:15,206][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:47:15,863][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:47:16,849][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:47:17,507][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:47:18,164][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:47:18,823][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:47:19,481][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:47:20,138][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:47:20,796][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:47:21,454][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:47:22,111][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:47:22,769][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:47:23,427][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:47:24,084][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:47:24,741][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:47:25,399][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:47:26,056][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:47:26,713][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:47:27,370][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:47:28,184][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:47:30,871][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:47:30,874][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:47:30,875][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:47:32,473][__main__][INFO] - Iteration 238 took 55s (10.93% Gen, 86.17% Train). Generation: 6s, Training: 47s. Estimated remaining time: 11h 38m 55s. Estimated total time: 15h 20m 22s. Time estimates for 10 more iterations: 9m 12s, 100 more iterations: 1h 32m 2s, 500 more iterations: 7h 40m 11s. [2026-03-25 17:47:32,476][__main__][INFO] - Starting iteration 238. [2026-03-25 17:47:32,480][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:47:32,481][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:47:37,288][__main__][INFO] - Number of regex retries in iteration 238: 0 [2026-03-25 17:47:37,289][__main__][INFO] - agents played in iteration 238 are Bob, Alice [2026-03-25 17:47:37,837][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:47:37,899][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:47:37,899][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:47:37,900][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:47:38,746][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:47:39,368][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:47:40,027][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:47:40,685][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:47:41,343][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:47:42,004][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:47:42,661][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:47:43,319][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:47:43,976][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:47:44,633][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:47:45,293][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:47:45,949][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:47:46,608][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:47:47,265][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:47:47,922][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:47:48,580][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:47:49,238][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:47:49,895][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:47:50,554][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:47:51,211][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:47:51,869][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:47:52,526][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:47:53,183][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:47:53,840][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:47:54,497][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:47:55,154][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:47:55,812][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:47:56,470][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:47:57,127][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:47:57,786][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:47:58,444][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:47:59,101][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:47:59,759][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:48:00,415][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:48:01,073][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:48:01,730][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:48:02,387][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:48:03,044][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:48:03,701][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:48:04,358][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:48:05,015][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:48:05,673][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:48:06,332][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:48:06,989][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:48:07,647][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:48:08,306][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:48:08,963][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:48:09,621][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:48:10,612][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:48:11,269][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:48:11,929][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:48:12,587][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:48:13,244][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:48:13,901][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:48:14,560][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:48:15,219][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:48:15,878][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:48:16,536][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:48:17,195][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:48:17,853][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:48:18,511][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:48:19,168][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:48:19,827][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:48:20,484][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:48:21,141][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:48:22,016][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:48:23,412][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:48:23,415][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:48:23,416][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:48:24,770][__main__][INFO] - Iteration 239 took 52s (9.19% Gen, 88.21% Train). Generation: 4s, Training: 46s. Estimated remaining time: 10h 49m 12s. Estimated total time: 14h 31m 32s. Time estimates for 10 more iterations: 8m 42s, 100 more iterations: 1h 27m 9s, 500 more iterations: 7h 15m 46s. [2026-03-25 17:48:24,772][__main__][INFO] - Starting iteration 239. [2026-03-25 17:48:24,776][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:48:24,777][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:48:34,203][__main__][INFO] - Number of regex retries in iteration 239: 0 [2026-03-25 17:48:34,204][__main__][INFO] - agents played in iteration 239 are Bob, Alice [2026-03-25 17:48:34,789][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:48:34,850][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:48:34,851][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:48:34,851][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:48:35,627][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:48:36,245][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:48:36,904][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:48:37,561][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:48:38,218][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:48:38,876][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:48:39,535][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:48:40,192][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:48:40,850][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:48:41,507][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:48:42,165][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:48:42,824][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:48:43,482][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:48:44,140][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:48:44,800][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:48:45,458][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:48:46,115][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:48:46,773][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:48:47,434][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:48:48,092][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:48:48,749][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:48:49,408][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:48:50,066][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:48:50,725][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:48:51,384][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:48:52,043][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:48:52,701][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:48:53,359][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:48:54,017][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:48:54,675][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:48:55,333][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:48:55,992][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:48:56,651][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:48:57,311][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:48:57,971][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:48:58,631][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:48:59,290][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:48:59,948][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:49:00,606][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:49:01,268][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:49:01,927][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:49:02,586][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:49:03,243][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:49:03,902][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:49:04,559][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:49:05,217][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:49:05,875][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:49:06,533][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:49:07,530][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:49:08,188][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:49:08,846][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:49:09,504][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:49:10,162][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:49:10,819][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:49:11,479][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:49:12,136][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:49:12,794][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:49:13,452][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:49:14,109][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:49:14,767][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:49:15,424][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:49:16,081][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:49:16,738][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:49:17,398][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:49:18,056][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:49:18,888][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:49:20,367][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:49:20,371][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:49:20,381][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:49:21,915][__main__][INFO] - Iteration 240 took 57s (16.50% Gen, 80.81% Train). Generation: 9s, Training: 46s. Estimated remaining time: 12h 9m 4s. Estimated total time: 15h 52m 20s. Time estimates for 10 more iterations: 9m 31s, 100 more iterations: 1h 35m 14s, 500 more iterations: 7h 56m 10s. [2026-03-25 17:49:21,918][__main__][INFO] - Starting iteration 240. [2026-03-25 17:49:21,923][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:49:21,924][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:49:29,958][__main__][INFO] - Number of regex retries in iteration 240: 0 [2026-03-25 17:49:29,959][__main__][INFO] - agents played in iteration 240 are Bob, Alice [2026-03-25 17:49:30,532][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:49:30,593][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:49:30,594][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:49:30,594][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:49:31,273][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:49:31,884][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:49:32,542][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:49:33,200][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:49:33,858][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:49:34,515][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:49:35,173][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:49:35,830][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:49:36,487][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:49:37,146][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:49:37,803][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:49:38,463][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:49:39,120][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:49:39,778][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:49:40,435][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:49:41,092][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:49:41,750][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:49:42,408][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:49:43,065][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:49:43,722][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:49:44,379][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:49:45,036][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:49:45,693][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:49:46,351][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:49:47,009][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:49:47,666][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:49:48,325][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:49:48,983][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:49:49,640][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:49:50,296][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:49:50,954][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:49:51,610][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:49:52,267][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:49:52,924][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:49:53,581][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:49:54,239][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:49:54,897][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:49:55,554][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:49:56,211][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:49:56,868][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:49:57,527][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:49:58,184][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:49:58,843][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:49:59,500][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:50:00,157][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:50:00,815][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:50:01,472][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:50:02,129][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:50:03,111][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:50:03,772][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:50:04,431][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:50:05,089][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:50:05,747][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:50:06,405][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:50:07,063][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:50:07,721][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:50:08,379][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:50:09,036][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:50:09,694][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:50:10,352][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:50:11,009][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:50:11,666][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:50:12,324][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:50:12,981][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:50:13,638][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:50:14,379][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:50:15,817][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:50:15,820][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:50:15,832][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:50:17,297][__main__][INFO] - Iteration 241 took 55s (14.47% Gen, 82.84% Train). Generation: 8s, Training: 45s. Estimated remaining time: 11h 38m 45s. Estimated total time: 15h 22m 57s. Time estimates for 10 more iterations: 9m 13s, 100 more iterations: 1h 32m 17s, 500 more iterations: 7h 41m 28s. [2026-03-25 17:50:17,300][__main__][INFO] - Starting iteration 241. [2026-03-25 17:50:17,304][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:50:17,304][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:50:24,683][__main__][INFO] - Number of regex retries in iteration 241: 0 [2026-03-25 17:50:24,684][__main__][INFO] - agents played in iteration 241 are Bob, Alice [2026-03-25 17:50:25,517][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:50:25,578][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:50:25,578][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:50:25,579][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:50:26,387][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:50:26,995][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:50:27,653][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:50:28,310][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:50:28,968][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:50:29,625][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:50:30,282][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:50:30,941][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:50:31,599][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:50:32,257][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:50:32,917][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:50:33,574][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:50:34,231][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:50:34,889][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:50:35,545][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:50:36,202][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:50:36,860][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:50:37,518][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:50:38,175][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:50:38,832][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:50:39,489][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:50:40,146][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:50:40,805][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:50:41,462][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:50:42,119][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:50:42,776][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:50:43,433][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:50:44,090][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:50:44,748][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:50:45,405][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:50:46,063][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:50:46,720][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:50:47,378][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:50:48,035][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:50:48,692][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:50:49,349][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:50:50,006][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:50:50,663][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:50:51,320][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:50:51,976][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:50:52,634][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:50:53,292][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:50:53,950][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:50:54,607][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:50:55,263][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:50:55,921][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:50:56,579][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:50:57,236][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:50:58,227][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:50:58,885][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:50:59,543][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:51:00,201][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:51:00,862][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:51:01,519][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:51:02,176][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:51:02,838][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:51:03,496][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:51:04,154][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:51:04,811][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:51:05,469][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:51:06,126][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:51:06,783][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:51:07,442][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:51:08,099][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:51:08,757][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:51:09,550][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:51:10,895][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:51:10,898][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:51:10,900][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:51:12,425][__main__][INFO] - Iteration 242 took 55s (13.39% Gen, 83.84% Train). Generation: 7s, Training: 46s. Estimated remaining time: 11h 33m 36s. Estimated total time: 15h 18m 43s. Time estimates for 10 more iterations: 9m 11s, 100 more iterations: 1h 31m 52s, 500 more iterations: 7h 39m 21s. [2026-03-25 17:51:12,427][__main__][INFO] - Starting iteration 242. [2026-03-25 17:51:12,433][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:51:12,433][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:51:18,548][__main__][INFO] - Number of regex retries in iteration 242: 0 [2026-03-25 17:51:18,548][__main__][INFO] - agents played in iteration 242 are Bob, Alice [2026-03-25 17:51:19,583][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:51:19,645][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:51:19,645][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:51:19,646][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:51:20,313][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:51:20,925][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:51:21,584][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:51:22,241][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:51:22,898][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:51:23,555][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:51:24,213][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:51:24,872][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:51:25,529][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:51:26,188][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:51:26,846][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:51:27,504][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:51:28,162][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:51:28,822][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:51:29,479][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:51:30,138][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:51:30,796][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:51:31,454][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:51:32,111][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:51:32,768][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:51:33,426][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:51:34,088][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:51:34,744][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:51:35,404][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:51:36,062][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:51:36,720][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:51:37,377][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:51:38,034][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:51:38,691][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:51:39,348][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:51:40,005][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:51:40,664][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:51:41,322][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:51:41,979][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:51:42,637][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:51:43,296][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:51:43,953][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:51:44,611][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:51:45,268][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:51:45,925][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:51:46,582][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:51:47,239][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:51:47,897][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:51:48,553][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:51:49,211][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:51:49,868][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:51:50,525][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:51:51,183][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:51:52,176][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:51:52,835][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:51:53,492][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:51:54,151][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:51:54,809][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:51:55,467][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:51:56,125][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:51:56,783][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:51:57,441][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:51:58,101][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:51:58,760][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:51:59,418][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:52:00,075][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:52:00,733][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:52:01,392][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:52:02,049][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:52:02,706][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:52:03,575][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:52:05,267][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:52:05,270][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:52:05,271][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:52:06,608][__main__][INFO] - Iteration 243 took 54s (11.29% Gen, 86.24% Train). Generation: 6s, Training: 46s. Estimated remaining time: 11h 16m 55s. Estimated total time: 15h 2m 57s. Time estimates for 10 more iterations: 9m 1s, 100 more iterations: 1h 30m 17s, 500 more iterations: 7h 31m 28s. [2026-03-25 17:52:06,611][__main__][INFO] - Starting iteration 243. [2026-03-25 17:52:06,617][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:52:06,618][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:52:24,098][__main__][INFO] - Number of regex retries in iteration 243: 0 [2026-03-25 17:52:24,099][__main__][INFO] - agents played in iteration 243 are Bob, Alice [2026-03-25 17:52:24,558][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:52:24,619][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:52:24,620][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:52:24,620][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:52:25,329][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:52:25,954][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:52:26,612][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:52:27,270][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:52:27,927][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:52:28,585][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:52:29,243][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:52:29,902][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:52:30,559][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:52:31,215][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:52:31,875][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:52:32,531][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:52:33,187][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:52:33,844][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:52:34,500][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:52:35,158][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:52:35,814][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:52:36,472][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:52:37,129][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:52:37,785][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:52:38,442][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:52:39,098][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:52:39,756][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:52:40,413][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:52:41,070][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:52:41,727][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:52:42,385][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:52:43,043][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:52:43,700][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:52:44,357][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:52:45,014][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:52:45,671][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:52:46,329][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:52:46,986][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:52:47,644][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:52:48,302][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:52:48,960][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:52:49,617][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:52:50,274][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:52:50,931][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:52:51,588][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:52:52,247][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:52:52,905][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:52:53,562][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:52:54,219][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:52:54,876][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:52:55,533][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:52:56,190][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:52:57,176][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:52:57,836][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:52:58,493][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:52:59,152][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:52:59,809][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:53:00,467][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:53:01,127][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:53:01,784][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:53:02,442][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:53:03,100][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:53:03,757][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:53:04,414][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:53:05,071][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:53:05,729][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:53:06,386][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:53:07,043][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:53:07,700][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:53:08,552][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:53:09,896][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:53:09,899][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:53:09,900][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:53:11,381][__main__][INFO] - Iteration 244 took 1m 4s (26.99% Gen, 70.72% Train). Generation: 17s, Training: 45s. Estimated remaining time: 14h 12m 20s. Estimated total time: 17h 59m 26s. Time estimates for 10 more iterations: 10m 47s, 100 more iterations: 1h 47m 56s, 500 more iterations: 8h 59m 43s. [2026-03-25 17:53:11,384][__main__][INFO] - Starting iteration 244. [2026-03-25 17:53:11,388][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:53:11,388][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:53:25,822][__main__][INFO] - Number of regex retries in iteration 244: 0 [2026-03-25 17:53:25,823][__main__][INFO] - agents played in iteration 244 are Bob, Alice [2026-03-25 17:53:26,391][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:53:26,452][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:53:26,452][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:53:26,453][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:53:27,288][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:53:27,907][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:53:28,557][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:53:29,214][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:53:29,870][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:53:30,526][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:53:31,185][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:53:31,842][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:53:32,500][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:53:33,158][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:53:33,814][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:53:34,471][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:53:35,129][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:53:35,786][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:53:36,443][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:53:37,100][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:53:37,759][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:53:38,417][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:53:39,074][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:53:39,731][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:53:40,387][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:53:41,044][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:53:41,700][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:53:42,358][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:53:43,016][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:53:43,674][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:53:44,332][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:53:44,990][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:53:45,646][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:53:46,304][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:53:46,961][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:53:47,618][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:53:48,276][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:53:48,933][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:53:49,590][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:53:50,247][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:53:50,904][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:53:51,561][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:53:52,218][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:53:52,875][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:53:53,532][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:53:54,189][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:53:54,846][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:53:55,503][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:53:56,161][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:53:56,825][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:53:57,479][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:53:58,139][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:53:59,147][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:53:59,807][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:54:00,465][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:54:01,122][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:54:01,779][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:54:02,436][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:54:03,093][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:54:03,751][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:54:04,408][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:54:05,065][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:54:05,723][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:54:06,383][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:54:07,039][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:54:07,696][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:54:08,355][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:54:09,014][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:54:09,672][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:54:10,569][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:54:11,920][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:54:11,923][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:54:11,924][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:54:13,254][__main__][INFO] - Iteration 245 took 1m 1s (23.33% Gen, 74.52% Train). Generation: 14s, Training: 46s. Estimated remaining time: 13h 23m 0s. Estimated total time: 17h 11m 8s. Time estimates for 10 more iterations: 10m 18s, 100 more iterations: 1h 43m 6s, 500 more iterations: 8h 35m 34s. [2026-03-25 17:54:13,256][__main__][INFO] - Starting iteration 245. [2026-03-25 17:54:13,260][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:54:13,261][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:54:20,293][__main__][INFO] - Number of regex retries in iteration 245: 0 [2026-03-25 17:54:20,295][__main__][INFO] - agents played in iteration 245 are Bob, Alice [2026-03-25 17:54:21,138][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:54:21,199][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:54:21,199][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:54:21,200][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:54:21,990][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:54:22,608][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:54:23,258][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:54:23,915][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:54:24,572][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:54:25,230][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:54:25,886][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:54:26,543][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:54:27,199][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:54:27,856][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:54:28,513][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:54:29,171][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:54:29,827][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:54:30,485][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:54:31,145][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:54:31,804][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:54:32,462][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:54:33,119][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:54:33,778][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:54:34,439][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:54:35,096][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:54:35,753][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:54:36,408][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:54:37,066][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:54:37,724][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:54:38,382][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:54:39,041][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:54:39,699][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:54:40,354][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:54:41,013][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:54:41,672][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:54:42,330][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:54:42,989][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:54:43,646][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:54:44,304][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:54:44,962][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:54:45,621][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:54:46,279][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:54:46,937][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:54:47,596][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:54:48,253][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:54:48,911][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:54:49,566][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:54:50,225][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:54:50,882][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:54:51,541][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:54:52,199][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:54:52,856][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:54:53,856][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:54:54,515][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:54:55,174][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:54:55,832][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:54:56,492][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:54:57,149][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:54:57,808][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:54:58,467][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:54:59,124][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:54:59,782][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:55:00,440][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:55:01,099][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:55:01,758][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:55:02,416][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:55:03,073][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:55:03,733][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:55:04,392][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:55:05,305][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:55:06,688][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:55:06,692][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:55:06,693][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:55:08,036][__main__][INFO] - Iteration 246 took 54s (12.84% Gen, 84.70% Train). Generation: 7s, Training: 46s. Estimated remaining time: 11h 23m 55s. Estimated total time: 15h 12m 58s. Time estimates for 10 more iterations: 9m 7s, 100 more iterations: 1h 31m 17s, 500 more iterations: 7h 36m 29s. [2026-03-25 17:55:08,039][__main__][INFO] - Starting iteration 246. [2026-03-25 17:55:08,043][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:55:08,044][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:55:17,256][__main__][INFO] - Number of regex retries in iteration 246: 0 [2026-03-25 17:55:17,257][__main__][INFO] - agents played in iteration 246 are Bob, Alice [2026-03-25 17:55:18,268][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:55:18,330][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:55:18,331][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:55:18,331][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:55:19,002][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:55:19,619][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:55:20,278][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:55:20,941][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:55:21,598][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:55:22,255][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:55:22,913][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:55:23,573][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:55:24,231][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:55:24,889][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:55:25,548][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:55:26,205][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:55:26,863][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:55:27,520][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:55:28,178][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:55:28,839][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:55:29,496][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:55:30,155][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:55:30,813][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:55:31,470][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:55:32,127][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:55:32,785][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:55:33,443][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:55:34,100][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:55:34,757][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:55:35,415][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:55:36,074][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:55:36,732][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:55:37,391][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:55:38,050][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:55:38,708][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:55:39,366][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:55:40,025][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:55:40,683][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:55:41,343][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:55:42,003][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:55:42,661][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:55:43,321][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:55:43,979][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:55:44,636][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:55:45,294][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:55:45,951][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:55:46,608][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:55:47,265][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:55:47,922][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:55:48,579][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:55:49,236][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:55:49,894][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:55:50,883][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:55:51,541][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:55:52,198][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:55:52,857][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:55:53,514][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:55:54,171][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:55:54,829][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:55:55,488][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:55:56,146][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:55:56,803][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:55:57,460][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:55:58,118][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:55:58,775][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:55:59,432][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:56:00,090][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:56:00,746][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:56:01,403][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:56:02,199][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:56:03,945][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:56:03,949][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:56:03,950][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:56:05,495][__main__][INFO] - Iteration 247 took 57s (16.04% Gen, 81.27% Train). Generation: 9s, Training: 46s. Estimated remaining time: 12h 7m 34s. Estimated total time: 15h 57m 34s. Time estimates for 10 more iterations: 9m 34s, 100 more iterations: 1h 35m 45s, 500 more iterations: 7h 58m 47s. [2026-03-25 17:56:05,499][__main__][INFO] - Starting iteration 247. [2026-03-25 17:56:05,504][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:56:05,505][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:56:13,611][__main__][INFO] - Number of regex retries in iteration 247: 0 [2026-03-25 17:56:13,613][__main__][INFO] - agents played in iteration 247 are Bob, Alice [2026-03-25 17:56:14,078][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:56:14,139][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:56:14,140][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:56:14,140][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:56:14,919][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:56:15,528][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:56:16,185][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:56:16,843][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:56:17,501][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:56:18,159][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:56:18,816][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:56:19,475][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:56:20,132][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:56:20,790][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:56:21,448][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:56:22,107][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:56:22,765][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:56:23,422][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:56:24,085][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:56:24,747][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:56:25,411][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:56:26,078][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:56:26,742][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:56:27,404][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:56:28,063][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:56:28,725][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:56:29,383][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:56:30,042][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:56:30,701][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:56:31,360][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:56:32,020][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:56:32,678][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:56:33,337][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:56:33,996][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:56:34,655][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:56:35,313][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:56:35,973][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:56:36,630][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:56:37,287][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:56:37,944][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:56:38,602][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:56:39,260][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:56:39,919][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:56:40,576][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:56:41,234][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:56:41,891][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:56:42,549][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:56:43,209][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:56:43,871][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:56:44,531][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:56:45,192][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:56:45,854][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:56:46,864][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:56:47,527][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:56:48,185][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:56:48,845][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:56:49,506][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:56:50,168][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:56:50,831][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:56:51,492][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:56:52,153][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:56:52,811][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:56:53,469][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:56:54,129][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:56:54,788][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:56:55,447][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:56:56,106][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:56:56,764][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:56:57,423][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:56:58,207][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:56:59,624][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:56:59,626][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:56:59,627][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:57:01,020][__main__][INFO] - Iteration 248 took 55s (14.60% Gen, 82.88% Train). Generation: 8s, Training: 46s. Estimated remaining time: 11h 34m 22s. Estimated total time: 15h 25m 18s. Time estimates for 10 more iterations: 9m 15s, 100 more iterations: 1h 32m 31s, 500 more iterations: 7h 42m 39s. [2026-03-25 17:57:01,022][__main__][INFO] - Starting iteration 248. [2026-03-25 17:57:01,026][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:57:01,026][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:57:05,836][__main__][INFO] - Number of regex retries in iteration 248: 0 [2026-03-25 17:57:05,837][__main__][INFO] - agents played in iteration 248 are Bob, Alice [2026-03-25 17:57:06,297][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:57:06,359][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:57:06,359][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:57:06,360][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:57:07,242][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:57:07,861][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:57:08,518][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:57:09,179][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:57:09,838][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:57:10,497][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:57:11,155][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:57:11,814][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:57:12,472][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:57:13,129][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:57:13,787][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:57:14,445][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:57:15,103][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:57:15,762][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:57:16,422][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:57:17,083][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:57:17,742][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:57:18,399][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:57:19,060][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:57:19,722][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:57:20,383][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:57:21,042][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:57:21,700][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:57:22,358][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:57:23,018][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:57:23,681][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:57:24,343][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:57:25,004][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:57:25,665][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:57:26,326][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:57:26,987][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:57:27,648][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:57:28,311][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:57:28,971][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:57:29,636][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:57:30,300][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:57:30,963][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:57:31,625][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:57:32,283][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:57:32,944][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:57:33,606][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:57:34,267][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:57:34,929][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:57:35,591][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:57:36,252][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:57:36,913][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:57:37,575][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:57:38,237][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:57:39,252][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:57:39,914][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:57:40,575][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:57:41,238][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:57:41,898][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:57:42,559][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:57:43,218][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:57:43,877][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:57:44,536][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:57:45,194][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:57:45,854][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:57:46,513][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:57:47,171][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:57:47,831][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:57:48,491][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:57:49,152][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:57:49,813][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:57:50,642][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:57:51,994][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:57:51,997][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:57:51,998][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:57:53,401][__main__][INFO] - Iteration 249 took 52s (9.18% Gen, 88.13% Train). Generation: 4s, Training: 46s. Estimated remaining time: 10h 41m 9s. Estimated total time: 14h 32m 57s. Time estimates for 10 more iterations: 8m 43s, 100 more iterations: 1h 27m 17s, 500 more iterations: 7h 16m 28s. [2026-03-25 17:57:53,403][__main__][INFO] - Starting iteration 249. [2026-03-25 17:57:53,407][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:57:53,407][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:58:05,917][__main__][INFO] - Number of regex retries in iteration 249: 0 [2026-03-25 17:58:05,918][__main__][INFO] - agents played in iteration 249 are Bob, Alice [2026-03-25 17:58:06,393][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:58:06,457][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:58:06,458][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:58:06,459][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:58:07,151][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:58:07,759][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:58:08,420][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:58:09,083][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:58:09,745][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:58:10,405][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:58:11,068][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:58:11,729][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:58:12,389][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:58:13,048][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:58:13,709][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:58:14,370][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:58:15,031][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:58:15,691][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:58:16,353][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:58:17,012][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:58:17,672][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:58:18,334][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:58:18,995][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:58:19,656][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:58:20,318][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:58:20,979][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:58:21,640][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:58:22,301][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:58:22,961][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:58:23,621][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:58:24,282][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:58:24,943][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:58:25,602][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:58:26,263][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:58:26,925][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:58:27,587][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:58:28,248][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:58:28,907][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:58:29,566][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:58:30,225][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:58:30,885][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:58:31,544][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:58:32,202][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:58:32,860][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:58:33,520][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:58:34,180][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:58:34,838][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:58:35,497][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:58:36,155][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:58:36,817][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:58:37,476][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:58:38,136][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:58:39,119][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:58:39,778][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:58:40,438][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:58:41,098][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:58:41,757][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:58:42,416][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:58:43,073][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:58:43,731][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:58:44,391][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:58:45,052][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:58:45,711][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:58:46,370][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:58:47,029][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:58:47,688][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:58:48,347][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:58:49,011][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:58:49,667][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:58:50,438][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:58:51,827][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:58:51,830][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:58:51,831][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:58:53,300][__main__][INFO] - Iteration 250 took 59s (20.89% Gen, 76.66% Train). Generation: 12s, Training: 45s. Estimated remaining time: 12h 45m 27s. Estimated total time: 16h 38m 15s. Time estimates for 10 more iterations: 9m 58s, 100 more iterations: 1h 39m 49s, 500 more iterations: 8h 19m 7s. [2026-03-25 17:58:53,302][__main__][INFO] - Starting iteration 250. [2026-03-25 17:58:53,307][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:58:53,307][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:58:58,975][__main__][INFO] - Number of regex retries in iteration 250: 0 [2026-03-25 17:58:58,976][__main__][INFO] - agents played in iteration 250 are Bob, Alice [2026-03-25 17:58:59,979][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:59:00,042][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 17:59:00,043][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:59:00,044][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:59:00,775][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:59:01,395][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:59:02,057][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:59:02,718][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:59:03,379][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:59:04,041][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:59:04,702][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:59:05,365][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:59:06,025][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:59:06,687][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:59:07,348][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:59:08,010][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:59:08,671][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:59:09,332][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:59:09,993][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:59:10,654][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:59:11,315][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:59:11,977][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:59:12,637][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:59:13,299][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:59:13,958][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:59:14,618][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:59:15,279][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:59:15,938][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:59:16,598][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:59:17,259][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:59:17,920][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:59:18,579][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:59:19,241][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:59:19,900][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:59:20,561][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:59:21,221][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:59:21,880][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:59:22,538][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:59:23,198][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:59:23,858][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:59:24,518][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:59:25,179][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:59:25,839][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:59:26,502][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:59:27,163][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:59:27,823][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:59:28,488][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:59:29,152][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:59:29,811][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:59:30,472][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:59:31,135][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:59:31,798][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:59:32,802][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:59:33,464][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:59:34,125][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:59:34,785][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:59:35,446][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:59:36,106][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:59:36,767][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:59:37,428][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:59:38,089][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:59:38,750][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:59:39,411][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:59:40,075][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:59:40,738][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:59:41,401][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:59:42,063][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:59:42,726][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:59:43,386][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:59:44,302][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 17:59:45,729][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:59:45,732][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:59:45,733][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:59:50,792][__main__][INFO] - Iteration 251 took 57s (9.86% Gen, 81.33% Train). Generation: 5s, Training: 46s. Estimated remaining time: 12h 4m 22s. Estimated total time: 15h 58m 7s. Time estimates for 10 more iterations: 9m 34s, 100 more iterations: 1h 35m 48s, 500 more iterations: 7h 59m 3s. [2026-03-25 17:59:50,794][__main__][INFO] - Starting iteration 251. [2026-03-25 17:59:50,799][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 17:59:50,799][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:00:03,265][__main__][INFO] - Number of regex retries in iteration 251: 0 [2026-03-25 18:00:03,266][__main__][INFO] - agents played in iteration 251 are Bob, Alice [2026-03-25 18:00:03,849][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:00:03,912][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:00:03,913][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:00:03,914][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:00:04,563][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:00:05,179][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:00:05,840][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:00:06,502][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:00:07,165][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:00:07,823][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:00:08,484][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:00:09,143][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:00:09,804][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:00:10,465][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:00:11,126][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:00:11,786][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:00:12,451][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:00:13,113][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:00:13,774][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:00:14,434][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:00:15,095][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:00:15,755][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:00:16,414][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:00:17,074][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:00:17,733][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:00:18,394][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:00:19,055][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:00:19,716][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:00:20,377][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:00:21,037][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:00:21,701][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:00:22,360][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:00:23,021][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:00:23,682][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:00:24,345][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:00:25,007][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:00:25,669][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:00:26,330][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:00:26,991][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:00:27,654][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:00:28,315][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:00:28,976][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:00:29,637][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:00:30,299][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:00:30,961][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:00:31,621][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:00:32,283][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:00:32,942][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:00:33,604][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:00:34,265][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:00:34,926][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:00:35,587][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:00:36,605][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:00:37,266][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:00:37,927][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:00:38,587][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:00:39,247][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:00:39,906][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:00:40,568][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:00:41,231][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:00:41,893][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:00:42,555][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:00:43,217][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:00:43,879][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:00:44,541][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:00:45,202][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:00:45,865][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:00:46,527][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:00:47,187][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:00:48,008][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 18:00:49,401][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:00:49,404][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:00:49,405][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:00:51,065][__main__][INFO] - Iteration 252 took 1m 0s (20.69% Gen, 76.56% Train). Generation: 12s, Training: 46s. Estimated remaining time: 12h 49m 42s. Estimated total time: 16h 44m 27s. Time estimates for 10 more iterations: 10m 2s, 100 more iterations: 1h 40m 26s, 500 more iterations: 8h 22m 13s. [2026-03-25 18:00:51,067][__main__][INFO] - Starting iteration 252. [2026-03-25 18:00:51,072][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:00:51,073][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:00:57,158][__main__][INFO] - Number of regex retries in iteration 252: 0 [2026-03-25 18:00:57,159][__main__][INFO] - agents played in iteration 252 are Bob, Alice [2026-03-25 18:00:57,634][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:00:57,702][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:00:57,703][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:00:57,704][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:00:58,436][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:00:59,041][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:00:59,706][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:01:00,369][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:01:01,031][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:01:01,695][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:01:02,356][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:01:03,020][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:01:03,682][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:01:04,345][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:01:05,008][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:01:05,670][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:01:06,333][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:01:06,997][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:01:07,660][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:01:08,322][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:01:08,986][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:01:09,648][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:01:10,312][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:01:10,976][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:01:11,638][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:01:12,300][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:01:12,963][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:01:13,627][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:01:14,288][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:01:14,951][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:01:15,614][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:01:16,277][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:01:16,940][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:01:17,604][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:01:18,269][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:01:18,932][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:01:19,596][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:01:20,260][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:01:20,923][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:01:21,585][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:01:22,249][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:01:22,911][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:01:23,573][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:01:24,236][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:01:24,899][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:01:25,562][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:01:26,224][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:01:26,885][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:01:27,548][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:01:28,210][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:01:28,874][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:01:29,538][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:01:30,542][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:01:31,204][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:01:31,867][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:01:32,530][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:01:33,193][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:01:33,855][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:01:34,518][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:01:35,180][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:01:35,842][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:01:36,504][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:01:37,167][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:01:37,829][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:01:38,491][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:01:39,152][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:01:39,815][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:01:40,478][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:01:41,141][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:01:41,923][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 18:01:43,338][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:01:43,341][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:01:43,342][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:01:44,768][__main__][INFO] - Iteration 253 took 53s (11.33% Gen, 86.00% Train). Generation: 6s, Training: 46s. Estimated remaining time: 10h 59m 19s. Estimated total time: 14h 54m 59s. Time estimates for 10 more iterations: 8m 56s, 100 more iterations: 1h 29m 29s, 500 more iterations: 7h 27m 29s. [2026-03-25 18:01:44,771][__main__][INFO] - Starting iteration 253. [2026-03-25 18:01:44,775][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:01:44,775][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:01:50,073][__main__][INFO] - Number of regex retries in iteration 253: 0 [2026-03-25 18:01:50,075][__main__][INFO] - agents played in iteration 253 are Bob, Alice [2026-03-25 18:01:50,852][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:01:50,919][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:01:50,921][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:01:50,923][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:01:51,719][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:01:52,331][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:01:52,997][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:01:53,660][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:01:54,322][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:01:54,986][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:01:55,647][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:01:56,308][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:01:56,972][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:01:57,634][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:01:58,298][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:01:58,960][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:01:59,628][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:02:00,291][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:02:00,952][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:02:01,613][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:02:02,275][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:02:02,937][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:02:03,599][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:02:04,264][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:02:04,924][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:02:05,585][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:02:06,247][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:02:06,909][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:02:07,572][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:02:08,231][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:02:08,892][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:02:09,555][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:02:10,216][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:02:10,878][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:02:11,543][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:02:12,204][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:02:12,865][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:02:13,529][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:02:14,193][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:02:14,851][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:02:15,510][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:02:16,168][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:02:16,826][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:02:17,485][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:02:18,145][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:02:18,806][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:02:19,467][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:02:20,128][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:02:20,790][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:02:21,452][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:02:22,117][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:02:22,779][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:02:23,823][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:02:24,485][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:02:25,144][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:02:25,804][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:02:26,463][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:02:27,123][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:02:27,786][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:02:28,448][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:02:29,108][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:02:29,768][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:02:30,429][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:02:31,096][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:02:31,755][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:02:32,418][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:02:33,078][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:02:33,737][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:02:34,397][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:02:35,334][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 18:02:36,720][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:02:36,723][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:02:36,724][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:02:38,228][__main__][INFO] - Iteration 254 took 53s (9.91% Gen, 87.27% Train). Generation: 5s, Training: 46s. Estimated remaining time: 10h 54m 22s. Estimated total time: 14h 50m 55s. Time estimates for 10 more iterations: 8m 54s, 100 more iterations: 1h 29m 5s, 500 more iterations: 7h 25m 27s. [2026-03-25 18:02:38,231][__main__][INFO] - Starting iteration 254. [2026-03-25 18:02:38,235][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:02:38,236][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:02:44,105][__main__][INFO] - Number of regex retries in iteration 254: 0 [2026-03-25 18:02:44,107][__main__][INFO] - agents played in iteration 254 are Bob, Alice [2026-03-25 18:02:44,748][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:02:44,811][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:02:44,811][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:02:44,812][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:02:45,480][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:02:46,100][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:02:46,763][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:02:47,426][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:02:48,087][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:02:48,747][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:02:49,409][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:02:50,071][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:02:50,734][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:02:51,396][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:02:52,056][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:02:52,715][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:02:53,375][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:02:54,035][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:02:54,697][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:02:55,356][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:02:56,016][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:02:56,674][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:02:57,334][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:02:57,994][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:02:58,656][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:02:59,317][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:02:59,976][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:03:00,634][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:03:01,292][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:03:01,950][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:03:02,609][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:03:03,270][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:03:03,932][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:03:04,592][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:03:05,254][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:03:05,915][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:03:06,577][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:03:07,239][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:03:07,901][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:03:08,559][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:03:09,219][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:03:09,879][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:03:10,540][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:03:11,203][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:03:11,864][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:03:12,524][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:03:13,183][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:03:14,177][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:03:14,839][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:03:15,503][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:03:16,163][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:03:16,825][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:03:17,488][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:03:18,151][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:03:18,814][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:03:19,478][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:03:20,140][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:03:20,804][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:03:21,467][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:03:22,196][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:03:22,867][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:03:23,526][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:03:24,189][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:03:24,850][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:03:25,523][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:03:26,198][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:03:26,856][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:03:27,517][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:03:28,182][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:03:28,963][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 18:03:30,339][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:03:30,343][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:03:30,344][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:03:31,828][__main__][INFO] - Iteration 255 took 53s (10.95% Gen, 86.27% Train). Generation: 5s, Training: 46s. Estimated remaining time: 10h 55m 48s. Estimated total time: 14h 53m 15s. Time estimates for 10 more iterations: 8m 55s, 100 more iterations: 1h 29m 19s, 500 more iterations: 7h 26m 37s. [2026-03-25 18:03:31,837][__main__][INFO] - Starting iteration 255. [2026-03-25 18:03:31,851][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:03:31,851][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:03:37,816][__main__][INFO] - Number of regex retries in iteration 255: 0 [2026-03-25 18:03:37,817][__main__][INFO] - agents played in iteration 255 are Bob, Alice [2026-03-25 18:03:38,718][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:03:38,782][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:03:38,783][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:03:38,783][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:03:39,482][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:03:40,101][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:03:40,759][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:03:41,422][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:03:42,083][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:03:42,746][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:03:43,409][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:03:44,072][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:03:44,734][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:03:45,395][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:03:46,056][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:03:46,718][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:03:47,379][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:03:48,041][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:03:48,702][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:03:49,361][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:03:50,020][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:03:50,682][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:03:51,343][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:03:52,005][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:03:52,665][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:03:53,325][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:03:53,985][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:03:54,646][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:03:55,307][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:03:55,968][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:03:56,630][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:03:57,293][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:03:57,955][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:03:58,618][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:03:59,282][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:03:59,946][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:04:00,607][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:04:01,270][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:04:01,934][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:04:02,597][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:04:03,260][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:04:03,923][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:04:04,584][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:04:05,244][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:04:05,905][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:04:06,567][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:04:07,230][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:04:07,895][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:04:08,556][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:04:09,217][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:04:09,877][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:04:10,537][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:04:11,580][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:04:12,240][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:04:12,899][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:04:13,560][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:04:14,219][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:04:14,881][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:04:15,543][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:04:16,211][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:04:16,866][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:04:17,529][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:04:18,190][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:04:18,850][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:04:19,512][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:04:20,175][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:04:20,836][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:04:21,497][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:04:22,157][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:04:22,962][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 18:04:24,432][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:04:24,435][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:04:24,436][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:04:25,917][__main__][INFO] - Iteration 256 took 54s (11.03% Gen, 86.22% Train). Generation: 5s, Training: 46s. Estimated remaining time: 11h 2m 47s. Estimated total time: 15h 1m 8s. Time estimates for 10 more iterations: 9m 0s, 100 more iterations: 1h 30m 6s, 500 more iterations: 7h 30m 34s. [2026-03-25 18:04:25,919][__main__][INFO] - Starting iteration 256. [2026-03-25 18:04:25,923][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:04:25,924][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:04:32,185][__main__][INFO] - Number of regex retries in iteration 256: 0 [2026-03-25 18:04:32,186][__main__][INFO] - agents played in iteration 256 are Bob, Alice [2026-03-25 18:04:33,429][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:04:33,491][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:04:33,492][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:04:33,492][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:04:34,159][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:04:34,776][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:04:35,436][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:04:36,095][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:04:36,754][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:04:37,417][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:04:38,077][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:04:38,736][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:04:39,395][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:04:40,054][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:04:40,714][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:04:41,374][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:04:42,036][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:04:42,696][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:04:43,355][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:04:44,016][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:04:44,676][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:04:45,335][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:04:45,993][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:04:46,652][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:04:47,311][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:04:47,970][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:04:48,631][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:04:49,291][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:04:49,952][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:04:50,611][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:04:51,272][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:04:51,932][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:04:52,592][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:04:53,251][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:04:53,910][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:04:54,569][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:04:55,229][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:04:55,888][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:04:56,546][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:04:57,204][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:04:57,862][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:04:58,523][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:04:59,183][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:04:59,846][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:05:00,506][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:05:01,165][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:05:01,829][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:05:02,494][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:05:03,158][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:05:03,819][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:05:04,479][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:05:05,141][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:05:06,128][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:05:06,788][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:05:07,447][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:05:08,106][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:05:08,765][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:05:09,423][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:05:10,084][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:05:10,744][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:05:11,405][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:05:12,066][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:05:12,725][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:05:13,387][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:05:14,046][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:05:14,706][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:05:15,370][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:05:16,031][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:05:16,692][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:05:17,592][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 18:05:18,980][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:05:18,983][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:05:18,984][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:05:20,573][__main__][INFO] - Iteration 257 took 54s (11.46% Gen, 85.63% Train). Generation: 6s, Training: 46s. Estimated remaining time: 11h 11m 36s. Estimated total time: 15h 10m 52s. Time estimates for 10 more iterations: 9m 6s, 100 more iterations: 1h 31m 5s, 500 more iterations: 7h 35m 26s. [2026-03-25 18:05:20,576][__main__][INFO] - Starting iteration 257. [2026-03-25 18:05:20,580][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:05:20,580][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:05:25,725][__main__][INFO] - Number of regex retries in iteration 257: 0 [2026-03-25 18:05:25,726][__main__][INFO] - agents played in iteration 257 are Bob, Alice [2026-03-25 18:05:26,341][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:05:26,405][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:05:26,406][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:05:26,406][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:05:27,150][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:05:27,762][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:05:28,424][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:05:29,086][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:05:29,747][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:05:30,412][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:05:31,077][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:05:31,740][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:05:32,403][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:05:33,063][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:05:33,725][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:05:34,386][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:05:35,048][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:05:35,710][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:05:36,374][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:05:37,037][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:05:37,699][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:05:38,363][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:05:39,024][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:05:39,686][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:05:40,348][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:05:41,009][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:05:41,671][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:05:42,334][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:05:42,996][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:05:43,658][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:05:44,321][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:05:44,981][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:05:45,643][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:05:46,306][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:05:46,966][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:05:47,628][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:05:48,291][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:05:48,953][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:05:49,616][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:05:50,278][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:05:50,940][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:05:51,603][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:05:52,266][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:05:52,929][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:05:53,593][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:05:54,254][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:05:54,917][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:05:55,580][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:05:56,243][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:05:56,905][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:05:57,568][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:05:58,231][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:05:59,248][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:05:59,909][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:06:00,570][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:06:01,233][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:06:01,895][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:06:02,556][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:06:03,219][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:06:03,882][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:06:04,544][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:06:05,206][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:06:05,866][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:06:06,526][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:06:07,184][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:06:07,843][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:06:08,505][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:06:09,166][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:06:09,826][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:06:10,638][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 18:06:11,781][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:06:11,784][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:06:11,785][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:06:13,209][__main__][INFO] - Iteration 258 took 52s (9.78% Gen, 87.51% Train). Generation: 5s, Training: 46s. Estimated remaining time: 10h 37m 3s. Estimated total time: 14h 37m 11s. Time estimates for 10 more iterations: 8m 46s, 100 more iterations: 1h 27m 43s, 500 more iterations: 7h 18m 35s. [2026-03-25 18:06:13,211][__main__][INFO] - Starting iteration 258. [2026-03-25 18:06:13,215][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:06:13,216][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:06:18,094][__main__][INFO] - Number of regex retries in iteration 258: 0 [2026-03-25 18:06:18,095][__main__][INFO] - agents played in iteration 258 are Bob, Alice [2026-03-25 18:06:18,676][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:06:18,737][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:06:18,738][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:06:18,739][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:06:19,666][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:06:20,284][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:06:20,948][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:06:21,610][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:06:22,271][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:06:22,933][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:06:23,594][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:06:24,255][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:06:24,915][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:06:25,575][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:06:26,234][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:06:26,893][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:06:27,552][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:06:28,211][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:06:28,872][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:06:29,536][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:06:30,195][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:06:30,855][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:06:31,514][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:06:32,173][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:06:32,832][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:06:33,492][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:06:34,154][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:06:34,816][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:06:35,475][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:06:36,138][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:06:36,797][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:06:37,457][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:06:38,117][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:06:38,776][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:06:39,435][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:06:40,095][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:06:40,754][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:06:41,415][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:06:42,074][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:06:42,734][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:06:43,394][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:06:44,054][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:06:44,714][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:06:45,375][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:06:46,035][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:06:46,695][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:06:47,355][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:06:48,015][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:06:48,675][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:06:49,335][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:06:49,995][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:06:50,656][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:06:51,627][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:06:52,287][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:06:52,949][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:06:53,607][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:06:54,267][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:06:54,928][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:06:55,589][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:06:56,248][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:06:56,907][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:06:57,566][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:06:58,225][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:06:58,886][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:06:59,545][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:07:00,205][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:07:00,864][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:07:01,524][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:07:02,182][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:07:02,968][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 18:07:04,375][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:07:04,378][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:07:04,379][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:07:05,744][__main__][INFO] - Iteration 259 took 52s (9.29% Gen, 88.11% Train). Generation: 4s, Training: 46s. Estimated remaining time: 10h 34m 29s. Estimated total time: 14h 35m 30s. Time estimates for 10 more iterations: 8m 45s, 100 more iterations: 1h 27m 33s, 500 more iterations: 7h 17m 45s. [2026-03-25 18:07:05,746][__main__][INFO] - Starting iteration 259. [2026-03-25 18:07:05,750][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:07:05,750][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:07:10,952][__main__][INFO] - Number of regex retries in iteration 259: 0 [2026-03-25 18:07:10,953][__main__][INFO] - agents played in iteration 259 are Bob, Alice [2026-03-25 18:07:11,516][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:07:11,577][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:07:11,578][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:07:11,578][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:07:12,491][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:07:13,103][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:07:13,763][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:07:14,423][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:07:15,083][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:07:15,745][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:07:16,405][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:07:17,066][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:07:17,725][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:07:18,385][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:07:19,046][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:07:19,706][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:07:20,366][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:07:21,026][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:07:21,686][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:07:22,346][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:07:23,005][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:07:23,665][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:07:24,324][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:07:24,984][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:07:25,644][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:07:26,305][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:07:26,966][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:07:27,626][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:07:28,287][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:07:28,949][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:07:29,610][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:07:30,272][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:07:30,930][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:07:31,591][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:07:32,252][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:07:32,911][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:07:33,571][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:07:34,232][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:07:34,892][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:07:35,552][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:07:36,213][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:07:36,874][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:07:37,534][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:07:38,196][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:07:38,859][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:07:39,519][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:07:40,179][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:07:40,839][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:07:41,499][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:07:42,160][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:07:42,820][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:07:43,480][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:07:44,470][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:07:45,129][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:07:45,790][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:07:46,449][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:07:47,108][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:07:47,767][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:07:48,427][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:07:49,086][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:07:49,748][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:07:50,409][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:07:51,069][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:07:51,729][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:07:52,389][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:07:53,049][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:07:53,709][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:07:54,369][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:07:55,029][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:07:55,812][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 18:07:57,216][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:07:57,219][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:07:57,220][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:07:58,644][__main__][INFO] - Iteration 260 took 52s (9.83% Gen, 87.47% Train). Generation: 5s, Training: 46s. Estimated remaining time: 10h 39m 43s. Estimated total time: 14h 41m 36s. Time estimates for 10 more iterations: 8m 48s, 100 more iterations: 1h 28m 9s, 500 more iterations: 7h 20m 48s. [2026-03-25 18:07:58,646][__main__][INFO] - Starting iteration 260. [2026-03-25 18:07:58,650][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:07:58,650][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:08:04,667][__main__][INFO] - Number of regex retries in iteration 260: 0 [2026-03-25 18:08:04,668][__main__][INFO] - agents played in iteration 260 are Bob, Alice [2026-03-25 18:08:05,255][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:08:05,324][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:08:05,325][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:08:05,325][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:08:06,004][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:08:06,624][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:08:07,285][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:08:07,947][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:08:08,608][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:08:09,271][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:08:09,933][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:08:10,594][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:08:11,253][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:08:11,913][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:08:12,573][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:08:13,233][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:08:13,893][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:08:14,553][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:08:15,213][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:08:15,873][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:08:16,533][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:08:17,193][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:08:17,852][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:08:18,513][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:08:19,174][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:08:19,833][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:08:20,493][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:08:21,152][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:08:21,812][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:08:22,473][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:08:23,132][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:08:23,792][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:08:24,452][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:08:25,111][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:08:25,771][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:08:26,430][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:08:27,092][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:08:27,751][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:08:28,411][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:08:29,071][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:08:29,733][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:08:30,395][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:08:31,055][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:08:31,715][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:08:32,375][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:08:33,034][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:08:33,697][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:08:34,358][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:08:35,018][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:08:35,680][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:08:36,342][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:08:37,005][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:08:37,989][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:08:38,649][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:08:39,308][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:08:39,968][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:08:40,628][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:08:41,289][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:08:41,952][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:08:42,616][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:08:43,277][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:08:43,942][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:08:44,606][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:08:45,270][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:08:45,933][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:08:46,596][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:08:47,260][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:08:47,923][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:08:48,586][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:08:49,394][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 18:08:50,822][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:08:50,826][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:08:50,827][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:08:54,031][__main__][INFO] - Iteration 261 took 55s (10.87% Gen, 83.34% Train). Generation: 6s, Training: 46s. Estimated remaining time: 11h 20m 14s. Estimated total time: 15h 23m 3s. Time estimates for 10 more iterations: 9m 13s, 100 more iterations: 1h 32m 18s, 500 more iterations: 7h 41m 31s. [2026-03-25 18:08:54,033][__main__][INFO] - Starting iteration 261. [2026-03-25 18:08:54,038][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:08:54,039][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:09:01,939][__main__][INFO] - Number of regex retries in iteration 261: 0 [2026-03-25 18:09:01,940][__main__][INFO] - agents played in iteration 261 are Bob, Alice [2026-03-25 18:09:03,110][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:09:03,174][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:09:03,174][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:09:03,175][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:09:03,967][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:09:04,585][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:09:05,248][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:09:05,910][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:09:06,571][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:09:07,232][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:09:07,899][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:09:08,560][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:09:09,222][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:09:09,882][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:09:10,544][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:09:11,206][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:09:11,868][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:09:12,530][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:09:13,192][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:09:13,853][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:09:14,516][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:09:15,178][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:09:15,839][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:09:16,503][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:09:17,165][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:09:17,826][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:09:18,487][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:09:19,148][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:09:19,811][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:09:20,472][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:09:21,134][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:09:21,796][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:09:22,457][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:09:23,119][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:09:23,782][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:09:24,443][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:09:25,105][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:09:25,767][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:09:26,428][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:09:27,089][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:09:27,749][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:09:28,410][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:09:29,072][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:09:29,735][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:09:30,397][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:09:31,058][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:09:31,717][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:09:32,376][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:09:33,037][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:09:33,697][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:09:34,359][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:09:35,023][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:09:36,029][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:09:36,692][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:09:37,352][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:09:38,010][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:09:38,669][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:09:39,329][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:09:39,991][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:09:40,653][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:09:41,312][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:09:41,972][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:09:42,631][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:09:43,290][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:09:43,949][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:09:44,608][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:09:45,267][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:09:45,926][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:09:46,585][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:09:47,387][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 18:09:48,758][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:09:48,761][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:09:48,762][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:09:50,239][__main__][INFO] - Iteration 262 took 56s (14.06% Gen, 83.31% Train). Generation: 7s, Training: 46s. Estimated remaining time: 11h 32m 58s. Estimated total time: 15h 36m 43s. Time estimates for 10 more iterations: 9m 22s, 100 more iterations: 1h 33m 40s, 500 more iterations: 7h 48m 21s. [2026-03-25 18:09:50,241][__main__][INFO] - Starting iteration 262. [2026-03-25 18:09:50,246][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:09:50,246][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:09:55,274][__main__][INFO] - Number of regex retries in iteration 262: 0 [2026-03-25 18:09:55,275][__main__][INFO] - agents played in iteration 262 are Bob, Alice [2026-03-25 18:09:55,744][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:09:55,808][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:09:55,809][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:09:55,809][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:09:56,510][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:09:57,115][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:09:57,779][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:09:58,440][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:09:59,103][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:09:59,768][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:10:00,430][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:10:01,094][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:10:01,757][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:10:02,420][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:10:03,087][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:10:03,747][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:10:04,410][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:10:05,074][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:10:05,735][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:10:06,397][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:10:07,059][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:10:07,720][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:10:08,382][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:10:09,043][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:10:09,705][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:10:10,365][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:10:11,031][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:10:11,691][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:10:12,354][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:10:13,015][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:10:13,678][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:10:14,340][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:10:15,003][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:10:15,664][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:10:16,327][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:10:16,989][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:10:17,651][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:10:18,312][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:10:18,973][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:10:19,637][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:10:20,299][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:10:20,964][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:10:21,626][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:10:22,289][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:10:22,952][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:10:23,614][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:10:24,277][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:10:24,939][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:10:25,603][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:10:26,267][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:10:26,929][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:10:27,591][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:10:28,590][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:10:29,252][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:10:29,914][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:10:30,577][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:10:31,239][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:10:31,899][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:10:32,558][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:10:33,219][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:10:33,880][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:10:34,541][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:10:35,202][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:10:35,863][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:10:36,524][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:10:37,185][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:10:37,846][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:10:38,509][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:10:39,172][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:10:39,978][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 18:10:41,353][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:10:41,356][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:10:41,357][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:10:42,710][__main__][INFO] - Iteration 263 took 52s (9.58% Gen, 87.83% Train). Generation: 5s, Training: 46s. Estimated remaining time: 10h 29m 49s. Estimated total time: 14h 34m 26s. Time estimates for 10 more iterations: 8m 44s, 100 more iterations: 1h 27m 26s, 500 more iterations: 7h 17m 13s. [2026-03-25 18:10:42,713][__main__][INFO] - Starting iteration 263. [2026-03-25 18:10:42,717][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:10:42,717][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:10:48,318][__main__][INFO] - Number of regex retries in iteration 263: 0 [2026-03-25 18:10:48,319][__main__][INFO] - agents played in iteration 263 are Bob, Alice [2026-03-25 18:10:48,896][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:10:48,960][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:10:48,961][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:10:48,962][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:10:49,673][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:10:50,290][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:10:50,961][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:10:51,623][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:10:52,286][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:10:52,953][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:10:53,616][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:10:54,279][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:10:54,943][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:10:55,606][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:10:56,269][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:10:56,934][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:10:57,597][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:10:58,260][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:10:58,923][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:10:59,587][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:11:00,275][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:11:00,940][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:11:01,603][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:11:02,266][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:11:02,928][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:11:03,591][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:11:04,253][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:11:04,917][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:11:05,579][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:11:06,243][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:11:06,906][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:11:07,568][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:11:08,231][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:11:08,894][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:11:09,557][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:11:10,219][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:11:10,883][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:11:11,546][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:11:12,211][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:11:12,875][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:11:13,539][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:11:14,201][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:11:14,864][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:11:15,526][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:11:16,189][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:11:16,851][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:11:17,515][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:11:18,175][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:11:18,835][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:11:19,496][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:11:20,157][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:11:20,816][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:11:21,812][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:11:22,472][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:11:23,131][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:11:23,792][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:11:24,452][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:11:25,110][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:11:25,769][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:11:26,428][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:11:27,087][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:11:27,747][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:11:28,407][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:11:29,068][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:11:29,727][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:11:30,387][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:11:31,047][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:11:31,713][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:11:32,371][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:11:33,232][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 18:11:34,661][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:11:34,664][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:11:34,665][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:11:36,546][__main__][INFO] - Iteration 264 took 53s (10.41% Gen, 86.09% Train). Generation: 5s, Training: 46s. Estimated remaining time: 10h 51m 40s. Estimated total time: 14h 57m 11s. Time estimates for 10 more iterations: 8m 58s, 100 more iterations: 1h 29m 43s, 500 more iterations: 7h 28m 35s. [2026-03-25 18:11:36,548][__main__][INFO] - Starting iteration 264. [2026-03-25 18:11:36,553][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:11:36,553][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:11:45,169][__main__][INFO] - Number of regex retries in iteration 264: 0 [2026-03-25 18:11:45,170][__main__][INFO] - agents played in iteration 264 are Bob, Alice [2026-03-25 18:11:45,651][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:11:45,714][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:11:45,714][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:11:45,715][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:11:46,426][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:11:47,049][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:11:47,710][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:11:48,370][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:11:49,030][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:11:49,689][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:11:50,349][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:11:51,010][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:11:51,669][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:11:52,330][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:11:52,989][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:11:53,648][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:11:54,308][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:11:54,967][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:11:55,626][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:11:57,600][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:11:58,259][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:11:58,921][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:11:59,581][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:12:00,241][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:12:00,900][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:12:01,558][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:12:02,216][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:12:02,874][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:12:03,532][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:12:04,190][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:12:04,849][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:12:05,508][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:12:06,167][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:12:06,826][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:12:07,485][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:12:08,144][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:12:08,803][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:12:09,461][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:12:10,119][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:12:10,779][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:12:11,439][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:12:12,098][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:12:12,758][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:12:13,418][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:12:14,082][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:12:14,744][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:12:15,405][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:12:16,066][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:12:16,725][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:12:17,386][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:12:18,045][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:12:18,707][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:12:19,703][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:12:20,362][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:12:21,021][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:12:21,682][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:12:22,341][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:12:23,000][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:12:23,661][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:12:24,322][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:12:24,981][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:12:25,641][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:12:26,301][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:12:26,961][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:12:27,624][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:12:28,284][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:12:28,945][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:12:29,607][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:12:30,267][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:12:31,075][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:44 [2026-03-25 18:12:32,441][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:12:32,444][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:12:32,445][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:12:33,650][__main__][INFO] - Iteration 265 took 57s (15.09% Gen, 82.79% Train). Generation: 8s, Training: 47s. Estimated remaining time: 11h 45m 11s. Estimated total time: 15h 51m 40s. Time estimates for 10 more iterations: 9m 31s, 100 more iterations: 1h 35m 10s, 500 more iterations: 7h 55m 50s. [2026-03-25 18:12:33,653][__main__][INFO] - Starting iteration 265. [2026-03-25 18:12:33,657][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:12:33,657][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:12:43,819][__main__][INFO] - Number of regex retries in iteration 265: 0 [2026-03-25 18:12:43,820][__main__][INFO] - agents played in iteration 265 are Bob, Alice [2026-03-25 18:12:44,969][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:12:45,032][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:12:45,033][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:12:45,033][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:12:45,843][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:12:46,461][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:12:47,125][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:12:47,784][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:12:48,447][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:12:49,110][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:12:49,772][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:12:50,433][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:12:51,096][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:12:51,759][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:12:52,421][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:12:53,084][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:12:53,746][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:12:54,407][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:12:55,068][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:12:55,730][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:12:56,394][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:12:57,055][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:12:57,714][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:12:58,632][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:12:59,291][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:13:00,067][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:13:00,726][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:13:01,384][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:13:02,044][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:13:02,703][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:13:03,361][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:13:04,020][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:13:04,679][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:13:05,337][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:13:05,995][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:13:06,654][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:13:07,312][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:13:07,970][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:13:08,628][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:13:09,287][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:13:09,945][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:13:10,605][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:13:11,264][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:13:11,923][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:13:12,584][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:13:13,244][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:13:13,902][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:13:14,561][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:13:15,219][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:13:15,878][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:13:16,536][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:13:17,195][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:13:18,178][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:13:18,837][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:13:19,498][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:13:20,156][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:13:20,815][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:13:21,474][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:13:22,136][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:13:22,796][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:13:23,456][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:13:24,113][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:13:24,771][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:13:25,432][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:13:26,092][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:13:26,749][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:13:27,410][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:13:28,068][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:13:28,729][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:13:29,443][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 18:13:30,994][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:13:30,997][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:13:31,000][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:13:32,442][__main__][INFO] - Iteration 266 took 58s (17.29% Gen, 80.26% Train). Generation: 10s, Training: 47s. Estimated remaining time: 12h 12m 20s. Estimated total time: 16h 19m 47s. Time estimates for 10 more iterations: 9m 47s, 100 more iterations: 1h 37m 58s, 500 more iterations: 8h 9m 53s. [2026-03-25 18:13:32,444][__main__][INFO] - Starting iteration 266. [2026-03-25 18:13:32,448][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:13:32,448][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:13:38,659][__main__][INFO] - Number of regex retries in iteration 266: 0 [2026-03-25 18:13:38,660][__main__][INFO] - agents played in iteration 266 are Bob, Alice [2026-03-25 18:13:39,131][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:13:39,193][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:13:39,194][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:13:39,194][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:13:39,993][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:13:40,613][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:13:41,275][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:13:41,934][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:13:42,593][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:13:43,253][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:13:43,912][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:13:44,572][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:13:45,232][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:13:45,891][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:13:46,550][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:13:47,210][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:13:47,871][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:13:48,530][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:13:49,191][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:13:49,851][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:13:50,512][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:13:51,172][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:13:51,831][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:13:52,490][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:13:53,149][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:13:53,809][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:13:54,468][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:13:55,128][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:13:55,788][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:13:56,448][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:13:57,107][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:13:57,767][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:13:58,427][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:13:59,089][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:13:59,750][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:14:00,411][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:14:01,069][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:14:01,729][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:14:02,388][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:14:03,047][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:14:03,706][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:14:04,366][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:14:05,026][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:14:05,685][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:14:06,345][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:14:07,005][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:14:07,664][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:14:08,323][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:14:08,982][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:14:09,643][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:14:10,304][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:14:10,965][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:14:11,951][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:14:12,610][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:14:13,269][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:14:13,927][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:14:14,587][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:14:15,245][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:14:15,903][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:14:16,562][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:14:17,221][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:14:17,880][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:14:18,537][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:14:19,196][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:14:19,857][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:14:20,515][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:14:21,175][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:14:21,834][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:14:22,493][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:14:23,310][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 18:14:24,717][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:14:24,720][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:14:24,721][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:14:26,115][__main__][INFO] - Iteration 267 took 53s (11.57% Gen, 85.82% Train). Generation: 6s, Training: 46s. Estimated remaining time: 10h 46m 8s. Estimated total time: 14h 54m 29s. Time estimates for 10 more iterations: 8m 56s, 100 more iterations: 1h 29m 26s, 500 more iterations: 7h 27m 14s. [2026-03-25 18:14:26,118][__main__][INFO] - Starting iteration 267. [2026-03-25 18:14:26,122][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:14:26,122][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:14:31,027][__main__][INFO] - Number of regex retries in iteration 267: 0 [2026-03-25 18:14:31,028][__main__][INFO] - agents played in iteration 267 are Bob, Alice [2026-03-25 18:14:31,626][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:14:31,688][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:14:31,688][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:14:31,689][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:14:32,360][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:14:32,971][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:14:33,632][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:14:34,292][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:14:34,952][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:14:35,610][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:14:36,270][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:14:36,929][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:14:37,590][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:14:38,249][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:14:38,909][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:14:39,569][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:14:40,231][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:14:40,892][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:14:41,555][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:14:42,212][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:14:42,872][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:14:43,532][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:14:44,194][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:14:44,855][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:14:45,513][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:14:46,176][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:14:46,835][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:14:47,494][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:14:48,153][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:14:48,813][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:14:49,473][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:14:50,132][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:14:50,791][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:14:51,450][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:14:52,110][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:14:52,771][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:14:53,430][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:14:54,089][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:14:54,749][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:14:55,410][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:14:56,069][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:14:56,728][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:14:57,387][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:14:58,046][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:14:58,706][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:14:59,366][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:15:00,025][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:15:00,684][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:15:01,343][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:15:02,002][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:15:02,663][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:15:03,327][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:15:04,325][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:15:04,986][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:15:05,645][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:15:06,305][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:15:06,965][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:15:07,624][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:15:08,284][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:15:08,945][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:15:09,607][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:15:10,267][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:15:10,929][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:15:11,587][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:15:12,248][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:15:12,907][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:15:13,568][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:15:14,227][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:15:14,888][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:15:15,742][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 18:15:17,152][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:15:17,155][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:15:17,156][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:15:18,610][__main__][INFO] - Iteration 268 took 52s (9.35% Gen, 87.88% Train). Generation: 4s, Training: 46s. Estimated remaining time: 10h 25m 37s. Estimated total time: 14h 34m 50s. Time estimates for 10 more iterations: 8m 44s, 100 more iterations: 1h 27m 29s, 500 more iterations: 7h 17m 25s. [2026-03-25 18:15:18,613][__main__][INFO] - Starting iteration 268. [2026-03-25 18:15:18,617][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:15:18,618][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:15:23,952][__main__][INFO] - Number of regex retries in iteration 268: 0 [2026-03-25 18:15:23,953][__main__][INFO] - agents played in iteration 268 are Bob, Alice [2026-03-25 18:15:24,448][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:15:24,510][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:15:24,510][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:15:24,511][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:15:25,230][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:15:25,866][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:15:26,527][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:15:27,191][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:15:27,850][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:15:28,511][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:15:29,175][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:15:29,833][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:15:30,495][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:15:31,157][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:15:31,816][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:15:32,477][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:15:33,135][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:15:33,795][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:15:34,454][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:15:35,114][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:15:35,773][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:15:36,433][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:15:37,092][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:15:37,751][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:15:38,410][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:15:39,069][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:15:39,728][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:15:40,387][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:15:41,047][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:15:41,706][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:15:42,365][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:15:43,024][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:15:43,684][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:15:44,343][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:15:45,002][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:15:45,663][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:15:46,324][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:15:46,985][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:15:47,646][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:15:48,309][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:15:48,972][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:15:49,634][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:15:50,297][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:15:50,961][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:15:51,622][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:15:52,285][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:15:52,947][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:15:53,610][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:15:54,272][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:15:54,933][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:15:55,595][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:15:56,256][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:15:57,255][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:15:57,916][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:15:58,578][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:15:59,241][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:15:59,902][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:16:00,563][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:16:01,226][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:16:01,888][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:16:02,549][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:16:03,211][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:16:03,873][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:16:04,536][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:16:05,198][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:16:05,860][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:16:06,520][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:16:07,183][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:16:07,846][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:16:08,623][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 18:16:10,026][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:16:10,029][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:16:10,030][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:16:12,363][__main__][INFO] - Iteration 269 took 53s (9.93% Gen, 85.73% Train). Generation: 5s, Training: 46s. Estimated remaining time: 10h 45m 40s. Estimated total time: 14h 55m 47s. Time estimates for 10 more iterations: 8m 57s, 100 more iterations: 1h 29m 34s, 500 more iterations: 7h 27m 53s. [2026-03-25 18:16:12,365][__main__][INFO] - Starting iteration 269. [2026-03-25 18:16:12,369][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:16:12,370][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:16:18,299][__main__][INFO] - Number of regex retries in iteration 269: 0 [2026-03-25 18:16:18,300][__main__][INFO] - agents played in iteration 269 are Bob, Alice [2026-03-25 18:16:19,382][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:16:19,444][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:16:19,445][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:16:19,446][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:16:20,294][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:16:20,920][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:16:21,582][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:16:22,243][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:16:22,905][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:16:23,565][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:16:24,227][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:16:24,889][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:16:25,552][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:16:26,219][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:16:26,882][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:16:27,545][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:16:28,208][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:16:28,870][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:16:29,532][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:16:30,194][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:16:30,855][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:16:31,521][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:16:32,181][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:16:32,842][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:16:33,503][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:16:34,164][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:16:34,825][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:16:35,486][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:16:36,146][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:16:36,805][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:16:37,465][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:16:38,125][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:16:38,784][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:16:39,444][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:16:40,105][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:16:40,765][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:16:41,425][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:16:42,084][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:16:42,743][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:16:43,405][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:16:44,067][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:16:44,726][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:16:45,385][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:16:46,045][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:16:46,705][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:16:47,365][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:16:48,024][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:16:48,684][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:16:49,344][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:16:50,003][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:16:50,663][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:16:51,323][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:16:52,312][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:16:52,972][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:16:53,631][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:16:54,291][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:16:54,951][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:16:55,611][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:16:56,269][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:16:56,927][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:16:57,585][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:16:58,244][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:16:58,905][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:16:59,564][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:17:00,222][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:17:00,883][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:17:01,542][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:17:02,201][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:17:02,860][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:17:03,642][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 18:17:05,060][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:17:05,063][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:17:05,065][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:17:06,526][__main__][INFO] - Iteration 270 took 54s (10.95% Gen, 86.35% Train). Generation: 5s, Training: 46s. Estimated remaining time: 10h 51m 37s. Estimated total time: 15h 2m 38s. Time estimates for 10 more iterations: 9m 1s, 100 more iterations: 1h 30m 15s, 500 more iterations: 7h 31m 19s. [2026-03-25 18:17:06,528][__main__][INFO] - Starting iteration 270. [2026-03-25 18:17:06,532][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:17:06,533][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:17:12,729][__main__][INFO] - Number of regex retries in iteration 270: 0 [2026-03-25 18:17:12,730][__main__][INFO] - agents played in iteration 270 are Bob, Alice [2026-03-25 18:17:13,559][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:17:13,621][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:17:13,622][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:17:13,622][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:17:14,285][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:17:14,901][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:17:15,564][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:17:16,225][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:17:16,885][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:17:17,545][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:17:18,205][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:17:18,864][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:17:19,523][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:17:20,182][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:17:20,841][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:17:21,501][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:17:22,161][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:17:22,821][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:17:23,480][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:17:24,139][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:17:24,799][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:17:25,459][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:17:26,119][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:17:26,778][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:17:27,437][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:17:28,097][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:17:28,757][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:17:29,417][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:17:30,075][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:17:30,736][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:17:31,396][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:17:32,056][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:17:32,716][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:17:33,376][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:17:34,036][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:17:34,696][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:17:35,357][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:17:36,016][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:17:36,678][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:17:37,338][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:17:37,996][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:17:38,655][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:17:39,315][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:17:39,974][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:17:40,633][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:17:41,293][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:17:41,953][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:17:42,611][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:17:43,271][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:17:43,933][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:17:44,596][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:17:45,259][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:17:46,266][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:17:46,927][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:17:47,588][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:17:48,253][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:17:48,914][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:17:49,577][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:17:50,241][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:17:50,900][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:17:51,559][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:17:52,218][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:17:52,880][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:17:53,539][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:17:54,198][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:17:54,860][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:17:55,521][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:17:56,184][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:17:56,845][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:17:57,609][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 18:17:59,005][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:17:59,008][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:17:59,010][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:18:01,013][__main__][INFO] - Iteration 271 took 54s (11.37% Gen, 84.94% Train). Generation: 6s, Training: 46s. Estimated remaining time: 10h 56m 7s. Estimated total time: 15h 8m 3s. Time estimates for 10 more iterations: 9m 4s, 100 more iterations: 1h 30m 48s, 500 more iterations: 7h 34m 1s. [2026-03-25 18:18:01,016][__main__][INFO] - Starting iteration 271. [2026-03-25 18:18:01,020][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:18:01,020][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:18:06,143][__main__][INFO] - Number of regex retries in iteration 271: 0 [2026-03-25 18:18:06,144][__main__][INFO] - agents played in iteration 271 are Bob, Alice [2026-03-25 18:18:06,713][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:18:06,777][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:18:06,778][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:18:06,778][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:18:07,526][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:18:08,148][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:18:08,810][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:18:09,471][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:18:10,133][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:18:10,794][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:18:11,453][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:18:12,114][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:18:12,774][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:18:13,435][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:18:14,096][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:18:14,756][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:18:15,417][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:18:16,076][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:18:16,735][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:18:17,399][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:18:18,060][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:18:18,720][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:18:19,383][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:18:20,044][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:18:20,706][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:18:21,367][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:18:22,029][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:18:22,691][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:18:23,352][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:18:24,014][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:18:24,674][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:18:25,335][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:18:25,996][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:18:26,657][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:18:27,320][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:18:27,982][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:18:28,642][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:18:29,303][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:18:29,963][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:18:30,623][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:18:31,283][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:18:31,947][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:18:32,609][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:18:33,271][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:18:33,933][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:18:34,596][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:18:35,256][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:18:35,915][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:18:36,574][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:18:37,233][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:18:37,892][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:18:38,551][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:18:39,543][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:18:40,204][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:18:40,864][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:18:41,522][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:18:42,181][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:18:42,840][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:18:43,501][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:18:44,160][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:18:44,821][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:18:45,480][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:18:46,139][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:18:46,797][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:18:47,457][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:18:48,115][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:18:48,774][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:18:49,434][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:18:50,093][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:18:50,884][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 18:18:52,336][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:18:52,339][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:18:52,341][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:18:53,783][__main__][INFO] - Iteration 272 took 52s (9.76% Gen, 87.50% Train). Generation: 5s, Training: 46s. Estimated remaining time: 10h 26m 36s. Estimated total time: 14h 39m 24s. Time estimates for 10 more iterations: 8m 47s, 100 more iterations: 1h 27m 56s, 500 more iterations: 7h 19m 42s. [2026-03-25 18:18:53,785][__main__][INFO] - Starting iteration 272. [2026-03-25 18:18:53,789][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:18:53,790][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:18:58,743][__main__][INFO] - Number of regex retries in iteration 272: 0 [2026-03-25 18:18:58,744][__main__][INFO] - agents played in iteration 272 are Bob, Alice [2026-03-25 18:18:59,264][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:18:59,326][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:18:59,327][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:18:59,327][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:19:00,045][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:19:00,669][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:19:01,330][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:19:01,989][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:19:02,648][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:19:03,309][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:19:03,971][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:19:04,632][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:19:05,292][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:19:05,952][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:19:06,613][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:19:07,274][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:19:07,936][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:19:08,596][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:19:09,257][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:19:09,918][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:19:10,579][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:19:11,238][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:19:11,898][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:19:12,558][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:19:13,218][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:19:13,878][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:19:14,538][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:19:15,198][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:19:15,859][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:19:16,519][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:19:17,180][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:19:17,839][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:19:18,498][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:19:19,158][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:19:19,818][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:19:20,478][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:19:21,138][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:19:21,798][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:19:22,461][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:19:23,122][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:19:23,783][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:19:24,457][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:19:25,120][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:19:25,784][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:19:26,445][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:19:27,108][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:19:27,771][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:19:28,434][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:19:29,098][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:19:29,761][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:19:30,425][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:19:31,087][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:19:32,214][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:19:32,877][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:19:33,537][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:19:34,199][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:19:34,862][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:19:35,524][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:19:36,187][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:19:36,849][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:19:37,511][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:19:38,175][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:19:38,836][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:19:39,498][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:19:40,159][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:19:40,821][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:19:41,483][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:19:42,144][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:19:42,807][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:19:43,655][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 18:19:45,058][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:19:45,060][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:19:45,062][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:19:46,498][__main__][INFO] - Iteration 273 took 52s (9.40% Gen, 87.87% Train). Generation: 4s, Training: 46s. Estimated remaining time: 10h 24m 49s. Estimated total time: 14h 38m 30s. Time estimates for 10 more iterations: 8m 47s, 100 more iterations: 1h 27m 51s, 500 more iterations: 7h 19m 15s. [2026-03-25 18:19:46,501][__main__][INFO] - Starting iteration 273. [2026-03-25 18:19:46,508][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:19:46,508][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:19:51,635][__main__][INFO] - Number of regex retries in iteration 273: 0 [2026-03-25 18:19:51,636][__main__][INFO] - agents played in iteration 273 are Bob, Alice [2026-03-25 18:19:52,128][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:19:52,193][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:19:52,193][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:19:52,194][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:19:53,118][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:19:53,737][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:19:54,404][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:19:55,063][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:19:55,724][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:19:56,386][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:19:57,047][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:19:57,713][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:19:58,378][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:19:59,042][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:19:59,706][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:20:00,373][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:20:01,035][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:20:01,700][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:20:02,362][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:20:03,024][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:20:03,687][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:20:04,445][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:20:05,132][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:20:05,795][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:20:06,457][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:20:07,119][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:20:07,780][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:20:08,441][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:20:09,103][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:20:09,765][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:20:10,427][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:20:11,088][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:20:11,750][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:20:12,411][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:20:13,073][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:20:13,734][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:20:14,396][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:20:15,059][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:20:15,720][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:20:16,383][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:20:17,044][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:20:17,705][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:20:18,370][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:20:19,033][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:20:19,695][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:20:20,356][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:20:21,019][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:20:21,683][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:20:22,345][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:20:23,008][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:20:23,669][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:20:24,330][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:20:25,355][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:20:26,017][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:20:26,677][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:20:27,338][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:20:27,997][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:20:28,659][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:20:29,318][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:20:29,977][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:20:30,638][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:20:31,298][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:20:31,957][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:20:32,616][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:20:33,277][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:20:33,935][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:20:34,594][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:20:35,254][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:20:35,913][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:20:36,811][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 18:20:38,192][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:20:38,194][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:20:38,196][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:20:39,661][__main__][INFO] - Iteration 274 took 53s (9.65% Gen, 87.59% Train). Generation: 5s, Training: 46s. Estimated remaining time: 10h 31m 20s. Estimated total time: 14h 45m 55s. Time estimates for 10 more iterations: 8m 51s, 100 more iterations: 1h 28m 35s, 500 more iterations: 7h 22m 57s. [2026-03-25 18:20:39,664][__main__][INFO] - Starting iteration 274. [2026-03-25 18:20:39,676][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:20:39,677][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:20:45,652][__main__][INFO] - Number of regex retries in iteration 274: 0 [2026-03-25 18:20:45,654][__main__][INFO] - agents played in iteration 274 are Bob, Alice [2026-03-25 18:20:46,526][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:20:46,592][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:20:46,593][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:20:46,593][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:20:47,451][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:20:48,062][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:20:48,721][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:20:49,381][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:20:50,041][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:20:50,701][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:20:51,362][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:20:52,023][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:20:52,682][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:20:53,341][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:20:54,000][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:20:54,660][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:20:55,318][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:20:55,977][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:20:56,636][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:20:57,295][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:20:57,954][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:20:58,614][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:20:59,273][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:20:59,932][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:21:00,592][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:21:01,253][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:21:01,914][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:21:02,576][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:21:03,238][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:21:03,900][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:21:04,560][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:21:05,222][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:21:05,884][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:21:06,547][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:21:07,210][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:21:07,872][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:21:08,536][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:21:09,199][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:21:09,863][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:21:10,525][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:21:11,188][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:21:11,854][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:21:12,518][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:21:13,180][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:21:13,843][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:21:14,505][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:21:15,165][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:21:15,827][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:21:16,489][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:21:17,151][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:21:17,814][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:21:18,477][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:21:19,469][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:21:20,136][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:21:20,798][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:21:21,464][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:21:22,129][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:21:22,791][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:21:23,456][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:21:24,118][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:21:24,778][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:21:25,440][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:21:26,102][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:21:26,764][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:21:27,424][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:21:28,085][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:21:28,746][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:21:29,407][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:21:30,069][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:21:30,877][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 18:21:32,285][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:21:32,288][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:21:32,289][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:21:34,013][__main__][INFO] - Iteration 275 took 54s (11.00% Gen, 85.82% Train). Generation: 5s, Training: 46s. Estimated remaining time: 10h 50m 9s. Estimated total time: 15h 5m 38s. Time estimates for 10 more iterations: 9m 3s, 100 more iterations: 1h 30m 33s, 500 more iterations: 7h 32m 49s. [2026-03-25 18:21:34,015][__main__][INFO] - Starting iteration 275. [2026-03-25 18:21:34,019][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:21:34,020][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:21:40,082][__main__][INFO] - Number of regex retries in iteration 275: 0 [2026-03-25 18:21:40,083][__main__][INFO] - agents played in iteration 275 are Bob, Alice [2026-03-25 18:21:40,660][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:21:40,725][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:21:40,726][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:21:40,726][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:21:41,419][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:21:42,034][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:21:42,698][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:21:43,358][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:21:44,018][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:21:44,679][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:21:45,338][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:21:45,999][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:21:46,658][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:21:47,317][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:21:47,976][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:21:48,636][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:21:49,295][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:21:49,954][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:21:50,614][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:21:51,273][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:21:51,931][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:21:52,590][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:21:53,250][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:21:53,910][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:21:54,569][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:21:55,228][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:21:55,887][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:21:56,547][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:21:57,207][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:21:57,866][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:21:58,526][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:21:59,185][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:21:59,845][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:22:00,504][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:22:01,163][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:22:01,822][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:22:02,481][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:22:03,143][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:22:03,805][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:22:04,464][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:22:05,124][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:22:05,784][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:22:06,443][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:22:07,101][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:22:07,760][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:22:08,420][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:22:09,079][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:22:09,738][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:22:10,396][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:22:11,055][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:22:11,715][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:22:12,373][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:22:13,362][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:22:14,021][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:22:14,680][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:22:15,339][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:22:16,002][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:22:16,665][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:22:17,324][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:22:17,983][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:22:18,643][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:22:19,302][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:22:19,961][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:22:20,619][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:22:21,278][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:22:21,937][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:22:22,595][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:22:23,255][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:22:23,914][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:22:24,793][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 18:22:25,920][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:22:25,923][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:22:25,924][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:22:27,183][__main__][INFO] - Iteration 276 took 53s (11.40% Gen, 86.22% Train). Generation: 6s, Training: 45s. Estimated remaining time: 10h 29m 43s. Estimated total time: 14h 46m 5s. Time estimates for 10 more iterations: 8m 51s, 100 more iterations: 1h 28m 36s, 500 more iterations: 7h 23m 2s. [2026-03-25 18:22:27,185][__main__][INFO] - Starting iteration 276. [2026-03-25 18:22:27,190][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:22:27,191][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:22:32,317][__main__][INFO] - Number of regex retries in iteration 276: 0 [2026-03-25 18:22:32,318][__main__][INFO] - agents played in iteration 276 are Bob, Alice [2026-03-25 18:22:32,918][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:22:32,980][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:22:32,980][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:22:32,981][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:22:33,645][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:22:34,260][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:22:34,921][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:22:35,580][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:22:36,239][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:22:36,898][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:22:37,557][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:22:38,216][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:22:38,875][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:22:39,534][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:22:40,194][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:22:40,853][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:22:41,511][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:22:42,171][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:22:42,831][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:22:43,490][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:22:44,149][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:22:44,808][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:22:45,470][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:22:46,129][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:22:46,789][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:22:47,447][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:22:48,106][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:22:48,767][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:22:49,428][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:22:50,087][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:22:50,747][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:22:51,406][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:22:52,066][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:22:52,728][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:22:53,387][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:22:54,046][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:22:54,706][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:22:55,364][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:22:56,026][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:22:56,687][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:22:57,349][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:22:58,008][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:22:58,670][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:22:59,329][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:22:59,989][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:23:00,649][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:23:01,308][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:23:01,968][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:23:02,627][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:23:03,286][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:23:03,946][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:23:04,606][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:23:05,595][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:23:06,255][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:23:06,914][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:23:07,573][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:23:08,231][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:23:08,890][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:23:09,550][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:23:10,211][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:23:10,870][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:23:11,528][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:23:12,187][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:23:12,845][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:23:13,504][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:23:14,164][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:23:14,823][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:23:15,482][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:23:16,140][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:23:16,916][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 18:23:18,313][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:23:18,315][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:23:18,316][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:23:19,741][__main__][INFO] - Iteration 277 took 52s (9.75% Gen, 87.53% Train). Generation: 5s, Training: 45s. Estimated remaining time: 10h 18m 39s. Estimated total time: 14h 35m 53s. Time estimates for 10 more iterations: 8m 45s, 100 more iterations: 1h 27m 35s, 500 more iterations: 7h 17m 56s. [2026-03-25 18:23:19,743][__main__][INFO] - Starting iteration 277. [2026-03-25 18:23:19,748][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:23:19,749][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:23:30,197][__main__][INFO] - Number of regex retries in iteration 277: 0 [2026-03-25 18:23:30,198][__main__][INFO] - agents played in iteration 277 are Bob, Alice [2026-03-25 18:23:30,815][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:23:30,875][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:23:30,876][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:23:30,877][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:23:31,708][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:23:32,320][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:23:32,980][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:23:33,638][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:23:34,298][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:23:34,955][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:23:35,613][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:23:36,272][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:23:36,931][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:23:37,589][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:23:38,247][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:23:38,905][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:23:39,564][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:23:40,225][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:23:40,883][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:23:41,541][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:23:42,199][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:23:42,857][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:23:43,517][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:23:44,175][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:23:44,834][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:23:45,493][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:23:46,151][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:23:46,809][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:23:47,468][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:23:48,127][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:23:48,785][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:23:49,443][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:23:50,101][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:23:50,760][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:23:51,418][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:23:52,076][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:23:52,735][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:23:53,393][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:23:54,052][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:23:54,734][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:23:57,180][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:23:57,838][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:23:58,496][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:23:59,153][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:23:59,811][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:24:00,701][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:24:01,359][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:24:02,016][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:24:02,674][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:24:03,332][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:24:03,990][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:24:04,649][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:24:05,653][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:24:06,313][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:24:06,971][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:24:07,629][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:24:08,288][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:24:08,948][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:24:09,606][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:24:10,265][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:24:10,924][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:24:11,583][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:24:12,241][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:24:12,899][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:24:13,557][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:24:14,215][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:24:14,875][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:24:15,533][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:24:16,192][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:24:16,991][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:45 [2026-03-25 18:24:18,515][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:24:18,518][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:24:18,519][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:24:20,039][__main__][INFO] - Iteration 278 took 1m 0s (17.33% Gen, 80.14% Train). Generation: 10s, Training: 48s. Estimated remaining time: 12h 26m 38s. Estimated total time: 16h 44m 53s. Time estimates for 10 more iterations: 10m 2s, 100 more iterations: 1h 40m 29s, 500 more iterations: 8h 22m 26s. [2026-03-25 18:24:20,041][__main__][INFO] - Starting iteration 278. [2026-03-25 18:24:20,065][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:24:20,066][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:24:29,565][__main__][INFO] - Number of regex retries in iteration 278: 0 [2026-03-25 18:24:29,566][__main__][INFO] - agents played in iteration 278 are Bob, Alice [2026-03-25 18:24:30,448][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:24:30,509][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:24:30,510][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:24:30,510][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:24:31,333][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:24:31,946][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:24:32,610][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:24:33,271][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:24:33,932][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:24:34,593][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:24:35,254][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:24:35,914][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:24:36,573][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:24:37,234][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:24:37,894][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:24:38,553][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:24:39,211][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:24:39,871][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:24:40,531][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:24:41,190][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:24:41,850][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:24:42,508][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:24:43,168][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:24:43,827][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:24:44,486][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:24:45,145][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:24:45,804][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:24:46,463][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:24:47,122][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:24:47,781][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:24:48,440][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:24:49,100][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:24:49,758][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:24:50,418][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:24:51,077][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:24:51,736][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:24:52,395][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:24:53,055][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:24:53,716][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:24:54,375][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:24:55,034][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:24:55,693][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:24:56,352][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:24:57,011][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:24:57,670][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:24:58,329][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:24:58,989][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:24:59,647][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:25:00,307][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:25:00,967][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:25:01,626][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:25:02,286][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:25:03,297][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:25:03,956][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:25:04,614][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:25:05,275][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:25:05,935][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:25:06,592][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:25:07,250][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:25:07,910][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:25:08,569][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:25:09,227][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:25:09,886][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:25:10,544][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:25:11,202][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:25:11,860][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:25:12,519][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:25:13,178][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:25:13,836][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:25:14,571][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 18:25:16,022][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:25:16,025][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:25:16,034][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:25:17,491][__main__][INFO] - Iteration 279 took 57s (16.54% Gen, 80.92% Train). Generation: 9s, Training: 46s. Estimated remaining time: 11h 37m 55s. Estimated total time: 15h 57m 7s. Time estimates for 10 more iterations: 9m 34s, 100 more iterations: 1h 35m 42s, 500 more iterations: 7h 58m 33s. [2026-03-25 18:25:17,493][__main__][INFO] - Starting iteration 279. [2026-03-25 18:25:17,498][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:25:17,499][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:25:24,233][__main__][INFO] - Number of regex retries in iteration 279: 0 [2026-03-25 18:25:24,234][__main__][INFO] - agents played in iteration 279 are Bob, Alice [2026-03-25 18:25:24,819][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:25:24,880][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:25:24,881][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:25:24,882][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:25:25,691][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:25:26,310][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:25:26,969][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:25:27,628][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:25:28,288][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:25:28,948][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:25:29,611][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:25:30,269][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:25:30,929][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:25:31,588][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:25:32,248][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:25:32,905][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:25:33,564][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:25:34,223][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:25:34,883][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:25:35,541][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:25:36,200][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:25:36,859][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:25:37,519][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:25:38,184][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:25:38,843][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:25:39,501][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:25:40,161][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:25:40,821][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:25:41,479][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:25:42,138][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:25:42,799][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:25:43,458][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:25:44,117][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:25:44,776][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:25:45,434][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:25:46,094][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:25:46,755][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:25:47,412][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:25:48,072][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:25:48,732][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:25:49,391][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:25:50,050][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:25:50,708][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:25:51,367][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:25:52,026][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:25:52,685][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:25:53,344][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:25:54,004][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:25:54,664][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:25:55,324][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:25:55,985][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:25:56,644][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:25:57,645][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:25:58,304][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:25:58,963][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:25:59,624][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:26:00,284][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:26:00,942][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:26:01,602][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:26:02,261][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:26:02,920][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:26:03,579][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:26:04,238][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:26:04,897][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:26:05,556][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:26:06,214][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:26:06,873][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:26:07,532][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:26:08,191][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:26:09,012][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 18:26:10,415][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:26:10,418][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:26:10,419][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:26:11,809][__main__][INFO] - Iteration 280 took 54s (12.40% Gen, 85.03% Train). Generation: 6s, Training: 46s. Estimated remaining time: 10h 45m 6s. Estimated total time: 15h 5m 13s. Time estimates for 10 more iterations: 9m 3s, 100 more iterations: 1h 30m 31s, 500 more iterations: 7h 32m 36s. [2026-03-25 18:26:11,811][__main__][INFO] - Starting iteration 280. [2026-03-25 18:26:11,815][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:26:11,816][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:26:17,358][__main__][INFO] - Number of regex retries in iteration 280: 0 [2026-03-25 18:26:17,359][__main__][INFO] - agents played in iteration 280 are Bob, Alice [2026-03-25 18:26:17,868][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:26:17,934][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:26:17,935][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:26:17,936][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:26:18,687][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:26:19,303][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:26:19,966][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:26:20,624][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:26:21,283][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:26:21,944][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:26:22,604][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:26:23,263][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:26:23,924][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:26:24,582][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:26:25,242][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:26:25,902][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:26:26,562][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:26:27,222][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:26:27,881][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:26:28,541][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:26:29,202][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:26:29,861][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:26:30,520][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:26:31,180][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:26:31,841][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:26:32,500][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:26:33,159][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:26:33,819][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:26:34,478][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:26:35,138][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:26:35,798][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:26:36,457][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:26:37,117][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:26:37,776][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:26:38,437][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:26:39,096][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:26:39,756][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:26:40,416][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:26:41,076][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:26:41,736][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:26:42,395][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:26:43,055][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:26:43,715][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:26:44,374][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:26:45,034][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:26:45,694][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:26:46,353][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:26:47,015][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:26:47,675][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:26:48,335][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:26:48,994][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:26:49,653][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:26:50,641][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:26:51,302][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:26:51,960][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:26:52,621][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:26:53,279][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:26:53,938][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:26:54,603][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:26:55,264][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:26:55,921][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:26:56,580][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:26:57,239][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:26:57,899][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:26:58,558][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:26:59,217][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:26:59,877][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:27:00,536][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:27:01,194][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:27:02,017][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 18:27:03,385][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:27:03,387][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:27:03,389][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:27:04,817][__main__][INFO] - Iteration 281 took 53s (10.46% Gen, 86.84% Train). Generation: 5s, Training: 46s. Estimated remaining time: 10h 22m 24s. Estimated total time: 14h 43m 24s. Time estimates for 10 more iterations: 8m 50s, 100 more iterations: 1h 28m 20s, 500 more iterations: 7h 21m 42s. [2026-03-25 18:27:04,819][__main__][INFO] - Starting iteration 281. [2026-03-25 18:27:04,823][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:27:04,824][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:27:09,825][__main__][INFO] - Number of regex retries in iteration 281: 0 [2026-03-25 18:27:09,826][__main__][INFO] - agents played in iteration 281 are Bob, Alice [2026-03-25 18:27:10,406][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:27:10,470][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:27:10,471][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:27:10,471][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:27:11,229][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:27:11,872][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:27:12,532][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:27:13,191][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:27:13,850][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:27:14,508][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:27:15,167][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:27:15,826][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:27:16,486][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:27:17,146][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:27:17,806][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:27:18,467][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:27:19,127][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:27:19,788][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:27:20,448][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:27:21,108][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:27:21,768][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:27:22,432][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:27:23,091][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:27:23,751][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:27:24,413][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:27:25,073][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:27:25,735][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:27:26,396][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:27:27,056][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:27:27,717][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:27:28,378][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:27:29,038][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:27:29,696][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:27:30,357][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:27:31,017][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:27:31,677][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:27:32,342][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:27:33,002][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:27:33,662][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:27:34,322][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:27:34,984][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:27:35,644][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:27:36,304][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:27:36,965][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:27:37,625][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:27:38,285][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:27:38,946][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:27:39,605][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:27:40,265][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:27:40,925][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:27:41,585][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:27:42,246][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:27:43,233][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:27:43,891][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:27:44,551][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:27:45,209][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:27:45,867][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:27:46,525][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:27:47,184][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:27:47,842][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:27:48,500][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:27:49,158][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:27:49,818][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:27:50,477][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:27:51,136][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:27:51,796][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:27:52,455][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:27:53,114][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:27:53,772][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:27:54,543][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 18:27:55,995][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:27:55,998][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:27:55,999][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:27:57,456][__main__][INFO] - Iteration 282 took 52s (9.50% Gen, 87.72% Train). Generation: 5s, Training: 46s. Estimated remaining time: 10h 15m 22s. Estimated total time: 14h 37m 14s. Time estimates for 10 more iterations: 8m 46s, 100 more iterations: 1h 27m 43s, 500 more iterations: 7h 18m 37s. [2026-03-25 18:27:57,460][__main__][INFO] - Starting iteration 282. [2026-03-25 18:27:57,464][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:27:57,465][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:28:02,902][__main__][INFO] - Number of regex retries in iteration 282: 0 [2026-03-25 18:28:02,903][__main__][INFO] - agents played in iteration 282 are Bob, Alice [2026-03-25 18:28:03,747][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:28:03,808][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:28:03,808][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:28:03,809][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:28:04,626][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:28:05,235][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:28:05,897][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:28:06,556][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:28:07,216][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:28:07,877][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:28:08,536][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:28:09,197][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:28:09,857][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:28:10,520][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:28:11,179][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:28:11,839][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:28:12,501][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:28:13,162][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:28:13,824][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:28:14,485][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:28:15,145][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:28:15,805][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:28:16,468][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:28:17,127][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:28:17,786][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:28:18,447][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:28:19,106][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:28:19,765][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:28:20,425][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:28:21,084][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:28:21,743][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:28:22,403][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:28:23,062][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:28:23,721][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:28:24,379][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:28:25,038][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:28:25,698][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:28:26,356][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:28:27,016][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:28:27,675][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:28:28,334][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:28:28,996][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:28:29,656][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:28:30,315][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:28:30,974][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:28:31,632][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:28:32,291][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:28:32,950][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:28:33,609][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:28:34,269][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:28:34,929][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:28:35,588][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:28:36,588][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:28:37,248][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:28:37,907][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:28:38,565][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:28:39,225][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:28:39,885][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:28:40,544][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:28:41,203][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:28:41,861][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:28:42,519][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:28:43,177][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:28:43,836][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:28:44,496][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:28:45,154][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:28:45,813][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:28:46,472][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:28:47,133][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:28:48,008][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 18:28:49,387][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:28:49,389][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:28:49,390][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:28:50,798][__main__][INFO] - Iteration 283 took 53s (10.20% Gen, 87.16% Train). Generation: 5s, Training: 46s. Estimated remaining time: 10h 26m 9s. Estimated total time: 14h 48m 54s. Time estimates for 10 more iterations: 8m 53s, 100 more iterations: 1h 28m 53s, 500 more iterations: 7h 24m 27s. [2026-03-25 18:28:50,800][__main__][INFO] - Starting iteration 283. [2026-03-25 18:28:50,804][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:28:50,804][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:28:56,767][__main__][INFO] - Number of regex retries in iteration 283: 0 [2026-03-25 18:28:56,768][__main__][INFO] - agents played in iteration 283 are Bob, Alice [2026-03-25 18:28:57,822][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:28:57,887][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:28:57,887][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:28:57,888][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:28:58,713][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:28:59,342][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:29:00,002][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:29:00,660][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:29:01,319][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:29:01,978][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:29:02,638][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:29:03,297][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:29:03,956][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:29:04,615][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:29:05,274][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:29:05,933][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:29:06,591][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:29:07,250][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:29:07,907][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:29:08,565][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:29:09,223][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:29:09,880][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:29:10,539][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:29:11,199][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:29:11,858][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:29:12,516][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:29:13,176][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:29:13,835][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:29:14,492][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:29:15,151][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:29:15,809][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:29:16,466][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:29:17,125][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:29:17,784][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:29:18,442][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:29:19,100][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:29:19,758][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:29:20,417][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:29:21,076][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:29:21,734][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:29:22,392][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:29:23,050][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:29:23,708][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:29:24,366][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:29:25,024][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:29:25,682][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:29:26,340][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:29:26,998][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:29:27,656][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:29:28,315][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:29:28,973][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:29:29,633][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:29:30,621][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:29:31,281][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:29:31,941][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:29:32,601][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:29:33,259][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:29:33,918][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:29:34,576][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:29:35,234][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:29:35,893][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:29:36,551][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:29:37,209][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:29:37,867][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:29:38,525][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:29:39,184][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:29:40,975][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:29:41,633][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:29:42,291][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:29:43,035][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:44 [2026-03-25 18:29:44,750][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:29:44,753][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:29:44,754][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:29:46,184][__main__][INFO] - Iteration 284 took 55s (10.77% Gen, 86.65% Train). Generation: 5s, Training: 47s. Estimated remaining time: 10h 59m 21s. Estimated total time: 15h 23m 2s. Time estimates for 10 more iterations: 9m 13s, 100 more iterations: 1h 32m 18s, 500 more iterations: 7h 41m 31s. [2026-03-25 18:29:46,187][__main__][INFO] - Starting iteration 284. [2026-03-25 18:29:46,190][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:29:46,191][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:29:51,692][__main__][INFO] - Number of regex retries in iteration 284: 0 [2026-03-25 18:29:51,693][__main__][INFO] - agents played in iteration 284 are Bob, Alice [2026-03-25 18:29:52,243][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:29:52,305][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:29:52,305][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:29:52,306][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:29:53,153][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:29:53,779][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:29:54,437][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:29:55,096][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:29:55,756][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:29:56,415][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:29:57,075][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:29:57,735][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:29:58,395][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:29:59,054][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:29:59,714][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:30:00,373][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:30:01,333][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:30:01,990][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:30:02,647][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:30:03,309][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:30:03,968][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:30:04,627][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:30:05,285][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:30:05,944][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:30:06,603][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:30:07,261][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:30:07,919][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:30:08,578][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:30:09,237][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:30:09,895][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:30:10,555][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:30:11,213][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:30:11,871][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:30:12,528][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:30:13,187][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:30:13,845][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:30:14,504][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:30:15,162][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:30:15,821][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:30:16,480][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:30:17,139][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:30:17,796][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:30:18,455][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:30:19,113][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:30:19,771][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:30:20,430][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:30:21,088][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:30:21,747][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:30:22,405][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:30:23,064][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:30:23,723][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:30:24,381][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:30:25,373][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:30:26,032][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:30:26,691][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:30:27,349][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:30:28,007][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:30:28,666][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:30:29,325][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:30:29,985][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:30:30,645][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:30:31,302][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:30:31,962][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:30:32,625][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:30:33,286][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:30:33,945][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:30:34,604][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:30:35,263][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:30:35,922][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:30:36,898][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 18:30:38,327][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:30:38,330][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:30:38,331][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:30:39,886][__main__][INFO] - Iteration 285 took 53s (10.25% Gen, 86.85% Train). Generation: 5s, Training: 46s. Estimated remaining time: 10h 30m 22s. Estimated total time: 14h 54m 57s. Time estimates for 10 more iterations: 8m 56s, 100 more iterations: 1h 29m 29s, 500 more iterations: 7h 27m 28s. [2026-03-25 18:30:39,889][__main__][INFO] - Starting iteration 285. [2026-03-25 18:30:39,893][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:30:39,893][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:30:49,606][__main__][INFO] - Number of regex retries in iteration 285: 0 [2026-03-25 18:30:49,608][__main__][INFO] - agents played in iteration 285 are Bob, Alice [2026-03-25 18:30:50,132][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:30:50,195][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:30:50,196][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:30:50,196][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:30:50,875][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:30:51,483][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:30:52,144][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:30:52,803][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:30:53,464][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:30:54,124][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:30:54,782][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:30:55,443][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:30:56,103][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:30:56,763][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:30:57,422][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:30:58,081][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:30:58,741][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:30:59,401][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:31:00,060][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:31:00,720][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:31:01,378][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:31:02,037][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:31:02,695][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:31:03,358][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:31:04,019][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:31:04,678][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:31:05,338][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:31:05,999][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:31:06,659][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:31:07,318][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:31:07,977][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:31:08,639][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:31:09,298][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:31:09,958][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:31:10,618][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:31:11,276][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:31:11,937][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:31:12,596][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:31:13,255][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:31:13,914][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:31:14,573][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:31:15,233][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:31:15,892][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:31:16,551][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:31:17,211][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:31:17,870][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:31:18,530][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:31:19,189][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:31:19,849][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:31:20,509][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:31:21,167][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:31:21,827][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:31:22,817][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:31:23,476][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:31:24,135][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:31:24,793][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:31:25,451][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:31:26,110][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:31:26,768][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:31:27,427][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:31:28,085][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:31:28,745][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:31:29,403][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:31:30,060][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:31:30,719][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:31:31,379][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:31:32,037][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:31:32,697][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:31:33,355][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:31:34,182][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 18:31:35,608][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:31:35,611][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:31:35,612][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:31:37,145][__main__][INFO] - Iteration 286 took 57s (16.97% Gen, 80.35% Train). Generation: 9s, Training: 46s. Estimated remaining time: 11h 28m 42s. Estimated total time: 15h 54m 14s. Time estimates for 10 more iterations: 9m 32s, 100 more iterations: 1h 35m 25s, 500 more iterations: 7h 57m 7s. [2026-03-25 18:31:37,147][__main__][INFO] - Starting iteration 286. [2026-03-25 18:31:37,151][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:31:37,152][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:31:43,250][__main__][INFO] - Number of regex retries in iteration 286: 0 [2026-03-25 18:31:43,250][__main__][INFO] - agents played in iteration 286 are Bob, Alice [2026-03-25 18:31:44,329][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:31:44,396][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:31:44,397][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:31:44,398][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:31:45,062][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:31:45,696][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:31:46,357][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:31:47,018][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:31:47,678][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:31:48,339][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:31:48,998][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:31:49,658][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:31:50,318][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:31:50,978][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:31:51,637][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:31:52,296][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:31:52,956][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:31:53,618][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:31:54,280][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:31:54,939][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:31:55,599][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:31:56,257][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:31:56,917][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:31:57,576][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:31:58,237][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:31:58,896][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:31:59,556][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:32:00,217][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:32:00,877][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:32:01,536][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:32:02,195][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:32:02,854][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:32:03,514][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:32:04,174][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:32:04,834][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:32:05,494][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:32:06,155][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:32:06,815][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:32:07,481][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:32:08,142][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:32:08,804][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:32:09,465][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:32:10,125][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:32:10,784][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:32:11,445][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:32:12,106][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:32:12,766][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:32:13,426][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:32:14,086][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:32:14,747][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:32:15,406][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:32:16,066][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:32:17,048][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:32:17,707][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:32:18,365][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:32:19,024][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:32:19,683][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:32:20,342][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:32:21,000][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:32:21,660][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:32:22,318][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:32:22,977][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:32:23,636][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:32:24,295][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:32:24,954][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:32:25,612][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:32:26,271][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:32:26,930][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:32:27,590][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:32:28,485][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 18:32:30,519][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:32:30,521][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:32:30,523][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:32:31,887][__main__][INFO] - Iteration 287 took 54s (11.14% Gen, 86.36% Train). Generation: 6s, Training: 47s. Estimated remaining time: 10h 45m 51s. Estimated total time: 15h 12m 17s. Time estimates for 10 more iterations: 9m 7s, 100 more iterations: 1h 31m 13s, 500 more iterations: 7h 36m 8s. [2026-03-25 18:32:31,889][__main__][INFO] - Starting iteration 287. [2026-03-25 18:32:31,893][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:32:31,894][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:32:37,682][__main__][INFO] - Number of regex retries in iteration 287: 0 [2026-03-25 18:32:37,684][__main__][INFO] - agents played in iteration 287 are Bob, Alice [2026-03-25 18:32:38,161][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:32:38,224][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:32:38,224][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:32:38,225][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:32:38,901][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:32:39,521][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:32:40,171][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:32:40,831][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:32:41,491][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:32:42,151][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:32:42,811][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:32:43,470][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:32:44,130][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:32:44,790][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:32:45,450][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:32:46,114][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:32:46,772][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:32:47,431][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:32:48,090][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:32:48,748][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:32:49,407][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:32:50,066][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:32:50,727][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:32:51,388][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:32:52,047][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:32:52,712][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:32:53,372][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:32:54,029][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:32:54,692][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:32:55,351][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:32:56,010][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:32:56,669][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:32:57,329][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:32:57,987][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:32:58,647][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:32:59,305][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:32:59,965][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:33:00,624][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:33:01,283][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:33:01,943][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:33:02,603][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:33:03,264][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:33:03,924][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:33:04,583][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:33:05,243][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:33:05,908][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:33:06,568][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:33:07,228][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:33:07,888][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:33:08,547][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:33:09,207][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:33:09,868][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:33:10,854][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:33:11,513][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:33:12,171][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:33:12,830][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:33:13,488][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:33:14,147][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:33:14,806][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:33:15,465][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:33:16,124][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:33:16,782][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:33:17,440][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:33:18,099][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:33:18,758][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:33:19,418][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:33:20,077][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:33:20,735][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:33:21,393][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:33:22,178][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 18:33:23,847][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:33:23,850][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:33:23,851][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:33:25,279][__main__][INFO] - Iteration 288 took 53s (10.85% Gen, 86.47% Train). Generation: 5s, Training: 46s. Estimated remaining time: 10h 22m 27s. Estimated total time: 14h 49m 47s. Time estimates for 10 more iterations: 8m 53s, 100 more iterations: 1h 28m 58s, 500 more iterations: 7h 24m 53s. [2026-03-25 18:33:25,281][__main__][INFO] - Starting iteration 288. [2026-03-25 18:33:25,285][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:33:25,286][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:33:30,354][__main__][INFO] - Number of regex retries in iteration 288: 0 [2026-03-25 18:33:30,355][__main__][INFO] - agents played in iteration 288 are Bob, Alice [2026-03-25 18:33:30,938][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:33:31,000][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:33:31,000][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:33:31,001][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:33:31,839][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:33:32,444][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:33:33,105][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:33:33,763][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:33:34,424][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:33:35,084][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:33:35,743][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:33:36,402][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:33:37,062][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:33:37,721][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:33:38,380][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:33:39,041][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:33:39,701][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:33:40,361][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:33:41,020][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:33:41,679][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:33:42,341][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:33:43,003][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:33:43,664][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:33:44,324][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:33:44,984][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:33:45,644][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:33:46,305][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:33:46,964][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:33:47,626][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:33:48,285][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:33:48,946][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:33:49,605][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:33:50,265][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:33:50,925][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:33:51,584][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:33:52,244][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:33:52,904][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:33:53,564][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:33:54,223][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:33:54,881][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:33:55,542][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:33:56,201][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:33:56,859][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:33:57,518][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:33:58,178][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:33:58,840][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:33:59,500][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:34:00,159][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:34:00,818][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:34:01,478][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:34:02,137][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:34:02,797][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:34:03,801][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:34:04,461][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:34:05,121][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:34:05,781][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:34:06,440][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:34:07,099][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:34:07,758][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:34:08,418][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:34:09,078][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:34:09,738][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:34:10,397][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:34:11,057][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:34:11,716][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:34:12,375][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:34:13,036][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:34:13,697][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:34:14,358][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:34:15,247][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 18:34:16,734][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:34:16,738][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:34:16,739][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:34:18,166][__main__][INFO] - Iteration 289 took 52s (9.59% Gen, 87.71% Train). Generation: 5s, Training: 46s. Estimated remaining time: 10h 13m 10s. Estimated total time: 14h 41m 23s. Time estimates for 10 more iterations: 8m 48s, 100 more iterations: 1h 28m 8s, 500 more iterations: 7h 20m 41s. [2026-03-25 18:34:18,168][__main__][INFO] - Starting iteration 289. [2026-03-25 18:34:18,172][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:34:18,172][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:34:23,247][__main__][INFO] - Number of regex retries in iteration 289: 0 [2026-03-25 18:34:23,248][__main__][INFO] - agents played in iteration 289 are Bob, Alice [2026-03-25 18:34:23,759][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:34:23,822][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:34:23,823][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:34:23,823][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:34:24,683][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:34:25,298][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:34:25,962][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:34:26,622][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:34:27,282][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:34:27,942][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:34:28,602][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:34:29,263][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:34:29,922][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:34:30,582][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:34:31,241][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:34:31,902][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:34:32,559][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:34:33,222][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:34:33,883][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:34:34,543][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:34:35,204][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:34:35,864][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:34:36,524][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:34:37,184][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:34:37,844][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:34:38,503][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:34:39,162][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:34:39,822][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:34:40,481][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:34:41,141][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:34:41,802][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:34:42,461][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:34:43,120][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:34:43,780][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:34:44,439][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:34:45,099][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:34:45,761][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:34:46,421][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:34:47,082][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:34:47,741][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:34:48,401][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:34:49,060][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:34:49,719][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:34:50,379][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:34:51,039][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:34:51,698][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:34:52,357][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:34:53,018][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:34:53,677][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:34:54,335][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:34:54,994][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:34:55,653][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:34:56,645][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:34:57,307][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:34:57,966][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:34:58,625][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:34:59,283][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:34:59,941][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:35:00,600][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:35:01,259][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:35:01,917][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:35:02,576][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:35:03,233][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:35:03,892][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:35:06,060][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:35:06,719][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:35:07,377][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:35:08,034][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:35:08,692][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:35:09,510][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:44 [2026-03-25 18:35:10,931][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:35:10,934][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:35:10,935][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:35:12,473][__main__][INFO] - Iteration 290 took 54s (9.35% Gen, 87.82% Train). Generation: 5s, Training: 47s. Estimated remaining time: 10h 35m 55s. Estimated total time: 15h 5m 2s. Time estimates for 10 more iterations: 9m 3s, 100 more iterations: 1h 30m 30s, 500 more iterations: 7h 32m 31s. [2026-03-25 18:35:12,475][__main__][INFO] - Starting iteration 290. [2026-03-25 18:35:12,480][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:35:12,480][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:35:17,591][__main__][INFO] - Number of regex retries in iteration 290: 0 [2026-03-25 18:35:17,591][__main__][INFO] - agents played in iteration 290 are Bob, Alice [2026-03-25 18:35:18,595][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:35:18,657][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:35:18,658][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:35:18,658][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:35:19,375][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:35:19,994][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:35:20,652][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:35:21,310][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:35:21,968][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:35:22,627][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:35:23,285][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:35:23,942][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:35:24,600][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:35:25,257][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:35:25,915][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:35:26,573][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:35:27,230][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:35:27,888][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:35:28,547][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:35:29,204][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:35:29,862][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:35:30,521][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:35:31,180][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:35:31,839][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:35:32,496][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:35:33,154][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:35:33,813][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:35:34,470][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:35:35,135][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:35:35,794][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:35:36,452][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:35:37,109][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:35:37,764][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:35:38,422][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:35:39,080][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:35:39,738][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:35:40,396][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:35:41,055][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:35:41,714][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:35:42,373][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:35:43,032][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:35:43,691][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:35:44,350][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:35:45,008][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:35:45,666][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:35:46,325][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:35:46,984][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:35:47,644][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:35:48,303][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:35:48,962][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:35:49,621][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:35:50,279][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:35:51,272][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:35:51,931][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:35:52,590][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:35:53,250][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:35:53,908][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:35:54,567][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:35:55,227][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:35:55,885][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:35:56,543][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:35:57,202][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:35:57,860][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:35:58,518][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:35:59,177][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:35:59,835][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:36:00,495][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:36:01,154][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:36:01,813][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:36:02,649][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 18:36:04,009][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:36:04,012][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:36:04,013][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:36:05,552][__main__][INFO] - Iteration 291 took 53s (9.63% Gen, 87.47% Train). Generation: 5s, Training: 46s. Estimated remaining time: 10h 14m 33s. Estimated total time: 14h 44m 34s. Time estimates for 10 more iterations: 8m 50s, 100 more iterations: 1h 28m 27s, 500 more iterations: 7h 22m 17s. [2026-03-25 18:36:05,554][__main__][INFO] - Starting iteration 291. [2026-03-25 18:36:05,559][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:36:05,559][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:36:14,254][__main__][INFO] - Number of regex retries in iteration 291: 0 [2026-03-25 18:36:14,256][__main__][INFO] - agents played in iteration 291 are Bob, Alice [2026-03-25 18:36:15,136][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:36:15,198][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:36:15,199][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:36:15,199][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:36:15,943][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:36:16,553][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:36:17,212][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:36:17,870][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:36:18,532][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:36:19,195][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:36:19,852][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:36:20,511][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:36:21,171][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:36:21,832][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:36:22,495][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:36:23,153][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:36:23,811][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:36:24,471][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:36:25,132][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:36:25,791][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:36:26,450][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:36:27,110][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:36:27,769][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:36:28,428][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:36:29,088][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:36:29,750][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:36:30,410][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:36:31,069][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:36:31,729][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:36:32,389][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:36:33,047][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:36:33,707][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:36:34,367][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:36:35,026][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:36:35,684][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:36:36,343][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:36:37,002][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:36:37,661][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:36:38,319][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:36:38,978][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:36:39,637][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:36:40,296][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:36:40,955][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:36:41,614][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:36:42,273][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:36:42,932][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:36:43,597][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:36:44,256][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:36:44,915][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:36:45,574][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:36:46,233][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:36:46,892][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:36:47,888][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:36:48,547][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:36:49,205][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:36:49,863][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:36:50,522][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:36:51,181][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:36:51,839][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:36:52,497][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:36:53,155][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:36:53,815][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:36:54,473][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:36:55,135][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:36:55,793][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:36:56,451][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:36:57,109][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:36:57,768][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:36:58,427][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:36:59,285][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 18:37:00,716][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:37:00,719][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:37:00,720][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:37:02,213][__main__][INFO] - Iteration 292 took 56s (15.35% Gen, 82.01% Train). Generation: 8s, Training: 46s. Estimated remaining time: 11h 13m 19s. Estimated total time: 15h 44m 16s. Time estimates for 10 more iterations: 9m 26s, 100 more iterations: 1h 34m 25s, 500 more iterations: 7h 52m 8s. [2026-03-25 18:37:02,216][__main__][INFO] - Starting iteration 292. [2026-03-25 18:37:02,221][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:37:02,222][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:37:09,929][__main__][INFO] - Number of regex retries in iteration 292: 0 [2026-03-25 18:37:09,930][__main__][INFO] - agents played in iteration 292 are Bob, Alice [2026-03-25 18:37:10,516][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:37:10,578][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:37:10,579][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:37:10,579][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:37:11,254][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:37:11,865][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:37:12,528][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:37:13,188][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:37:13,847][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:37:14,508][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:37:15,166][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:37:15,826][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:37:16,484][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:37:17,144][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:37:17,802][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:37:18,461][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:37:19,119][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:37:19,778][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:37:20,437][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:37:21,097][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:37:21,756][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:37:22,415][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:37:23,075][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:37:23,734][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:37:24,394][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:37:25,053][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:37:25,712][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:37:26,370][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:37:27,030][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:37:27,689][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:37:28,348][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:37:29,008][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:37:29,667][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:37:30,326][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:37:30,985][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:37:31,644][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:37:32,303][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:37:32,961][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:37:33,620][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:37:34,279][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:37:34,938][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:37:35,597][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:37:36,256][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:37:36,915][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:37:37,574][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:37:38,233][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:37:38,892][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:37:39,551][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:37:40,210][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:37:40,869][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:37:41,527][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:37:42,186][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:37:43,178][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:37:43,837][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:37:44,495][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:37:45,153][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:37:45,811][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:37:46,469][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:37:47,130][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:37:47,788][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:37:48,446][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:37:49,104][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:37:49,762][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:37:50,420][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:37:51,078][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:37:51,737][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:37:52,396][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:37:53,054][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:37:53,712][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:37:54,629][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 18:37:55,983][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:37:55,986][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:37:55,988][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:37:57,605][__main__][INFO] - Iteration 293 took 55s (13.92% Gen, 83.16% Train). Generation: 7s, Training: 46s. Estimated remaining time: 10h 51m 14s. Estimated total time: 15h 23m 7s. Time estimates for 10 more iterations: 9m 13s, 100 more iterations: 1h 32m 18s, 500 more iterations: 7h 41m 33s. [2026-03-25 18:37:57,608][__main__][INFO] - Starting iteration 293. [2026-03-25 18:37:57,612][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:37:57,612][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:38:03,348][__main__][INFO] - Number of regex retries in iteration 293: 0 [2026-03-25 18:38:03,349][__main__][INFO] - agents played in iteration 293 are Bob, Alice [2026-03-25 18:38:03,823][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:38:03,884][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:38:03,885][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:38:03,885][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:38:04,804][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:38:05,415][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:38:06,074][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:38:06,735][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:38:07,395][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:38:08,055][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:38:08,713][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:38:09,371][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:38:10,030][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:38:10,689][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:38:11,347][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:38:12,006][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:38:12,664][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:38:13,323][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:38:13,981][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:38:14,638][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:38:15,297][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:38:15,956][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:38:16,615][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:38:17,273][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:38:17,932][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:38:18,591][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:38:19,249][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:38:19,907][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:38:20,565][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:38:21,222][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:38:21,880][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:38:22,538][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:38:23,197][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:38:23,856][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:38:24,514][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:38:25,172][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:38:25,830][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:38:26,489][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:38:27,147][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:38:27,806][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:38:28,465][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:38:29,124][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:38:29,781][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:38:30,439][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:38:31,097][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:38:31,755][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:38:32,412][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:38:33,070][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:38:33,727][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:38:34,386][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:38:35,044][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:38:35,702][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:38:36,700][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:38:37,359][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:38:38,017][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:38:38,675][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:38:39,334][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:38:39,993][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:38:40,652][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:38:41,311][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:38:41,970][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:38:42,628][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:38:43,287][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:38:43,946][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:38:44,604][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:38:45,263][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:38:45,922][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:38:46,580][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:38:47,238][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:38:48,020][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 18:38:49,360][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:38:49,363][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:38:49,364][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:38:50,796][__main__][INFO] - Iteration 294 took 53s (10.79% Gen, 86.52% Train). Generation: 5s, Training: 46s. Estimated remaining time: 10h 13m 40s. Estimated total time: 14h 46m 25s. Time estimates for 10 more iterations: 8m 51s, 100 more iterations: 1h 28m 38s, 500 more iterations: 7h 23m 12s. [2026-03-25 18:38:50,798][__main__][INFO] - Starting iteration 294. [2026-03-25 18:38:50,802][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:38:50,802][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:38:59,791][__main__][INFO] - Number of regex retries in iteration 294: 0 [2026-03-25 18:38:59,792][__main__][INFO] - agents played in iteration 294 are Bob, Alice [2026-03-25 18:39:00,853][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:39:00,915][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:39:00,915][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:39:00,916][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:39:01,793][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:39:02,402][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:39:03,062][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:39:03,725][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:39:04,383][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:39:05,044][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:39:05,701][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:39:06,359][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:39:07,019][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:39:07,679][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:39:08,338][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:39:08,997][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:39:09,656][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:39:10,314][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:39:10,974][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:39:11,631][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:39:12,288][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:39:12,947][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:39:13,610][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:39:14,268][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:39:14,928][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:39:15,588][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:39:16,245][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:39:16,908][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:39:17,568][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:39:18,227][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:39:18,887][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:39:19,547][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:39:20,206][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:39:20,863][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:39:21,523][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:39:22,182][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:39:22,841][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:39:23,499][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:39:24,157][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:39:24,815][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:39:25,477][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:39:26,136][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:39:26,795][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:39:27,453][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:39:28,111][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:39:28,771][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:39:29,430][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:39:30,090][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:39:30,749][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:39:31,406][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:39:32,064][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:39:32,723][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:39:33,714][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:39:34,374][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:39:35,032][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:39:35,693][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:39:36,351][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:39:37,009][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:39:37,668][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:39:38,328][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:39:38,987][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:39:39,646][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:39:40,304][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:39:40,962][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:39:41,620][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:39:42,278][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:39:42,937][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:39:43,595][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:39:44,256][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:39:45,036][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 18:39:46,447][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:39:46,450][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:39:46,451][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:39:47,909][__main__][INFO] - Iteration 295 took 57s (15.74% Gen, 81.70% Train). Generation: 8s, Training: 46s. Estimated remaining time: 11h 18m 5s. Estimated total time: 15h 51m 48s. Time estimates for 10 more iterations: 9m 31s, 100 more iterations: 1h 35m 10s, 500 more iterations: 7h 55m 54s. [2026-03-25 18:39:47,911][__main__][INFO] - Starting iteration 295. [2026-03-25 18:39:47,915][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:39:47,916][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:39:57,693][__main__][INFO] - Number of regex retries in iteration 295: 0 [2026-03-25 18:39:57,694][__main__][INFO] - agents played in iteration 295 are Bob, Alice [2026-03-25 18:39:58,298][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:39:58,362][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:39:58,362][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:39:58,363][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:39:59,095][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:39:59,713][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:40:00,372][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:40:01,031][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:40:01,690][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:40:02,349][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:40:03,010][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:40:03,668][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:40:04,330][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:40:04,989][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:40:05,649][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:40:06,309][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:40:06,968][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:40:07,629][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:40:08,288][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:40:08,948][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:40:09,607][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:40:10,267][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:40:10,927][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:40:11,586][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:40:12,245][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:40:12,904][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:40:13,563][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:40:14,228][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:40:14,888][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:40:15,548][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:40:16,210][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:40:16,869][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:40:17,528][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:40:18,189][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:40:18,848][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:40:19,508][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:40:20,168][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:40:20,827][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:40:21,487][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:40:22,147][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:40:22,807][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:40:23,466][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:40:24,126][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:40:24,786][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:40:25,447][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:40:26,107][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:40:26,765][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:40:27,424][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:40:28,084][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:40:28,752][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:40:29,412][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:40:30,072][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:40:31,063][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:40:31,722][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:40:32,383][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:40:33,042][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:40:33,701][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:40:34,360][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:40:35,019][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:40:35,678][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:40:36,337][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:40:36,996][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:40:37,656][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:40:38,317][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:40:38,977][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:40:39,636][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:40:40,294][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:40:40,953][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:40:41,612][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:40:42,414][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 18:40:43,841][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:40:43,844][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:40:43,845][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:40:45,399][__main__][INFO] - Iteration 296 took 57s (17.01% Gen, 80.28% Train). Generation: 9s, Training: 46s. Estimated remaining time: 11h 23m 25s. Estimated total time: 15h 58m 5s. Time estimates for 10 more iterations: 9m 34s, 100 more iterations: 1h 35m 48s, 500 more iterations: 7h 59m 2s. [2026-03-25 18:40:45,401][__main__][INFO] - Starting iteration 296. [2026-03-25 18:40:45,405][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:40:45,405][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:40:50,460][__main__][INFO] - Number of regex retries in iteration 296: 0 [2026-03-25 18:40:50,461][__main__][INFO] - agents played in iteration 296 are Bob, Alice [2026-03-25 18:40:51,057][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:40:51,119][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:40:51,120][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:40:51,121][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:40:51,929][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:40:52,551][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:40:53,210][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:40:53,872][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:40:54,534][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:40:55,195][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:40:55,856][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:40:56,515][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:40:57,175][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:40:57,835][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:40:58,495][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:40:59,156][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:40:59,816][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:41:00,477][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:41:01,138][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:41:01,798][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:41:02,458][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:41:03,120][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:41:03,780][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:41:04,440][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:41:05,100][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:41:05,761][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:41:06,420][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:41:07,080][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:41:07,740][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:41:08,401][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:41:09,062][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:41:09,721][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:41:10,380][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:41:11,042][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:41:11,702][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:41:12,360][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:41:13,019][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:41:13,678][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:41:14,339][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:41:14,999][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:41:15,659][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:41:16,318][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:41:16,978][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:41:17,638][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:41:18,297][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:41:18,956][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:41:19,616][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:41:20,276][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:41:20,936][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:41:21,596][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:41:22,256][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:41:22,915][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:41:23,896][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:41:24,555][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:41:25,216][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:41:25,876][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:41:26,536][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:41:27,195][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:41:27,854][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:41:28,512][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:41:29,172][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:41:29,832][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:41:30,490][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:41:31,148][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:41:31,806][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:41:32,464][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:41:33,123][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:41:33,781][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:41:34,440][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:41:35,184][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 18:41:36,778][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:41:36,781][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:41:36,782][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:41:38,282][__main__][INFO] - Iteration 297 took 52s (9.56% Gen, 87.60% Train). Generation: 5s, Training: 46s. Estimated remaining time: 10h 5m 46s. Estimated total time: 14h 41m 19s. Time estimates for 10 more iterations: 8m 48s, 100 more iterations: 1h 28m 7s, 500 more iterations: 7h 20m 39s. [2026-03-25 18:41:38,285][__main__][INFO] - Starting iteration 297. [2026-03-25 18:41:38,288][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:41:38,289][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:41:43,483][__main__][INFO] - Number of regex retries in iteration 297: 0 [2026-03-25 18:41:43,484][__main__][INFO] - agents played in iteration 297 are Bob, Alice [2026-03-25 18:41:43,961][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:41:44,022][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:41:44,023][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:41:44,023][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:41:44,915][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:41:45,595][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:41:46,256][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:41:46,915][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:41:47,575][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:41:48,235][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:41:48,894][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:41:49,554][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:41:50,216][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:41:50,877][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:41:51,536][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:41:52,196][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:41:52,856][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:41:53,515][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:41:54,174][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:41:54,835][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:41:55,496][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:41:56,156][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:41:56,816][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:41:57,478][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:41:58,141][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:41:58,799][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:41:59,459][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:42:00,119][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:42:00,778][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:42:01,439][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:42:02,100][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:42:02,760][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:42:03,420][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:42:04,084][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:42:04,746][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:42:05,406][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:42:06,066][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:42:06,726][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:42:07,385][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:42:08,045][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:42:08,705][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:42:09,365][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:42:10,025][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:42:10,685][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:42:11,346][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:42:12,007][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:42:12,668][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:42:13,328][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:42:13,989][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:42:14,648][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:42:15,310][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:42:15,970][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:42:16,959][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:42:17,618][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:42:18,280][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:42:18,940][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:42:19,599][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:42:20,260][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:42:20,920][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:42:21,579][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:42:22,240][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:42:22,899][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:42:23,560][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:42:24,217][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:42:24,879][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:42:25,538][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:42:26,199][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:42:26,859][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:42:27,518][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:42:28,277][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 18:42:29,653][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:42:29,656][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:42:29,657][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:42:31,039][__main__][INFO] - Iteration 298 took 52s (9.85% Gen, 87.53% Train). Generation: 5s, Training: 46s. Estimated remaining time: 10h 2m 46s. Estimated total time: 14h 39m 12s. Time estimates for 10 more iterations: 8m 47s, 100 more iterations: 1h 27m 55s, 500 more iterations: 7h 19m 36s. [2026-03-25 18:42:31,041][__main__][INFO] - Starting iteration 298. [2026-03-25 18:42:31,049][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:42:31,049][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:42:36,217][__main__][INFO] - Number of regex retries in iteration 298: 0 [2026-03-25 18:42:36,218][__main__][INFO] - agents played in iteration 298 are Bob, Alice [2026-03-25 18:42:36,797][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:42:36,860][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:42:36,860][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:42:36,861][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:42:37,680][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:42:38,291][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:42:38,956][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:42:39,618][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:42:40,280][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:42:40,940][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:42:41,599][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:42:42,260][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:42:42,921][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:42:43,581][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:42:44,241][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:42:44,902][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:42:45,562][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:42:46,223][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:42:46,884][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:42:47,546][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:42:48,207][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:42:48,866][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:42:49,526][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:42:50,187][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:42:50,847][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:42:51,506][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:42:52,166][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:42:52,826][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:42:53,486][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:42:54,144][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:42:54,804][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:42:55,463][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:42:56,122][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:42:56,781][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:42:57,440][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:42:58,101][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:42:58,761][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:42:59,422][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:43:00,082][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:43:00,742][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:43:01,401][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:43:02,061][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:43:02,720][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:43:03,380][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:43:04,039][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:43:04,698][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:43:05,359][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:43:06,018][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:43:06,677][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:43:07,336][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:43:07,996][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:43:08,655][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:43:09,647][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:43:10,306][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:43:10,965][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:43:11,624][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:43:12,282][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:43:12,941][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:43:13,600][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:43:14,259][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:43:14,918][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:43:15,576][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:43:16,233][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:43:16,892][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:43:17,551][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:43:18,209][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:43:18,867][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:43:19,525][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:43:20,184][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:43:20,965][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 18:43:22,451][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:43:22,454][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:43:22,455][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:43:23,843][__main__][INFO] - Iteration 299 took 52s (9.79% Gen, 87.58% Train). Generation: 5s, Training: 46s. Estimated remaining time: 10h 2m 37s. Estimated total time: 14h 39m 56s. Time estimates for 10 more iterations: 8m 47s, 100 more iterations: 1h 27m 59s, 500 more iterations: 7h 19m 58s. [2026-03-25 18:43:23,845][__main__][INFO] - Starting iteration 299. [2026-03-25 18:43:23,849][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:43:23,849][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:43:30,391][__main__][INFO] - Number of regex retries in iteration 299: 0 [2026-03-25 18:43:30,392][__main__][INFO] - agents played in iteration 299 are Bob, Alice [2026-03-25 18:43:31,257][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:43:31,317][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:43:31,318][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:43:31,318][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:43:32,096][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:43:32,717][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:43:33,381][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:43:34,041][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:43:34,701][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:43:35,362][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:43:36,023][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:43:36,684][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:43:37,343][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:43:38,002][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:43:38,663][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:43:39,324][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:43:39,984][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:43:40,645][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:43:41,304][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:43:41,964][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:43:42,625][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:43:43,284][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:43:43,943][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:43:44,602][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:43:45,261][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:43:45,921][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:43:46,580][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:43:47,239][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:43:47,900][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:43:48,559][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:43:49,218][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:43:49,878][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:43:50,537][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:43:51,197][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:43:51,855][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:43:52,514][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:43:53,174][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:43:53,835][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:43:54,494][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:43:55,153][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:43:55,813][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:43:56,473][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:43:57,132][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:43:57,790][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:43:58,450][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:43:59,110][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:43:59,770][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:44:00,429][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:44:01,088][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:44:01,747][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:44:02,406][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:44:03,066][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:44:04,057][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:44:04,716][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:44:05,375][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:44:06,034][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:44:06,692][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:44:07,350][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:44:08,009][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:44:08,666][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:44:09,324][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:44:09,984][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:44:10,642][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:44:11,300][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:44:11,958][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:44:12,616][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:44:13,275][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:44:13,933][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:44:14,591][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:44:15,345][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 18:44:16,729][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:44:16,732][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:44:16,733][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:44:18,543][__main__][INFO] - Iteration 300 took 54s (11.96% Gen, 84.72% Train). Generation: 6s, Training: 46s. Estimated remaining time: 10h 33m 22s. Estimated total time: 15h 11m 35s. Time estimates for 10 more iterations: 9m 6s, 100 more iterations: 1h 31m 9s, 500 more iterations: 7h 35m 47s. [2026-03-25 18:44:18,545][__main__][INFO] - Starting iteration 300. [2026-03-25 18:44:18,550][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:44:18,551][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:44:23,959][__main__][INFO] - Number of regex retries in iteration 300: 0 [2026-03-25 18:44:23,959][__main__][INFO] - agents played in iteration 300 are Bob, Alice [2026-03-25 18:44:24,539][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:44:24,601][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:44:24,601][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:44:24,602][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:44:25,399][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:44:26,007][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:44:26,665][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:44:27,324][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:44:27,982][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:44:28,642][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:44:29,300][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:44:29,958][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:44:30,617][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:44:31,276][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:44:31,935][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:44:32,593][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:44:33,251][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:44:33,911][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:44:34,568][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:44:35,227][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:44:35,886][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:44:36,544][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:44:37,203][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:44:37,862][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:44:38,520][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:44:39,180][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:44:39,838][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:44:40,496][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:44:41,153][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:44:41,811][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:44:42,470][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:44:43,128][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:44:43,787][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:44:44,445][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:44:45,105][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:44:45,764][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:44:46,423][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:44:47,082][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:44:47,741][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:44:48,400][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:44:49,058][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:44:49,716][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:44:50,374][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:44:51,034][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:44:51,693][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:44:52,352][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:44:53,010][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:44:53,671][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:44:54,337][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:44:54,998][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:44:55,656][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:44:56,315][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:44:57,305][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:44:57,965][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:44:58,624][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:44:59,282][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:44:59,942][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:45:00,600][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:45:01,259][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:45:01,917][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:45:02,576][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:45:03,234][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:45:03,893][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:45:04,551][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:45:05,209][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:45:05,867][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:45:06,526][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:45:07,185][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:45:07,843][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:45:08,670][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 18:45:10,015][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:45:10,018][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:45:10,019][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:45:12,785][__main__][INFO] - Iteration 301 took 54s (9.97% Gen, 84.92% Train). Generation: 5s, Training: 46s. Estimated remaining time: 10h 24m 49s. Estimated total time: 15h 3m 56s. Time estimates for 10 more iterations: 9m 2s, 100 more iterations: 1h 30m 23s, 500 more iterations: 7h 31m 58s. [2026-03-25 18:45:12,787][__main__][INFO] - Starting iteration 301. [2026-03-25 18:45:12,791][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 18:45:12,791][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:45:18,000][__main__][INFO] - Number of regex retries in iteration 301: 0 [2026-03-25 18:45:18,001][__main__][INFO] - agents played in iteration 301 are Bob, Alice [2026-03-25 18:45:18,473][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:45:18,535][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:45:18,536][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:45:18,536][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:45:19,271][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:45:19,880][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:45:20,540][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:45:21,200][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:45:21,859][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:45:22,518][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:45:23,177][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:45:23,836][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:45:24,500][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:45:25,158][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:45:27,203][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:45:27,861][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:45:28,519][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:45:29,179][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:45:29,836][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:45:30,495][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:45:31,153][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:45:31,813][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:45:32,471][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:45:33,129][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:45:33,789][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:45:34,446][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:45:35,102][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:45:35,760][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:45:36,419][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:45:37,078][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:45:37,737][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:45:38,396][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:45:39,054][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:45:39,712][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:45:40,371][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:45:41,030][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:45:41,689][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:45:42,347][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:45:43,006][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:45:43,664][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:45:44,322][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:45:44,980][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:45:45,640][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:45:46,300][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:45:46,958][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:45:47,615][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:45:48,280][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:45:48,932][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:45:49,589][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:45:50,248][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:45:50,907][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:45:51,565][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:45:52,559][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:45:53,218][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:45:53,876][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:45:54,536][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:45:55,193][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:45:55,851][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:45:56,510][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:45:57,171][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:45:57,827][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:45:58,485][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:45:59,144][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:45:59,802][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:46:00,460][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:46:01,119][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:46:01,778][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:46:02,435][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:46:03,093][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:46:03,882][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:44 [2026-03-25 18:46:05,288][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:46:05,292][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:46:05,293][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:46:06,805][__main__][INFO] - Iteration 302 took 54s (9.64% Gen, 87.55% Train). Generation: 5s, Training: 47s. Estimated remaining time: 10h 20m 14s. Estimated total time: 15h 0m 15s. Time estimates for 10 more iterations: 9m 0s, 100 more iterations: 1h 30m 1s, 500 more iterations: 7h 30m 7s. [2026-03-25 18:46:06,808][__main__][INFO] - Starting iteration 302. [2026-03-25 18:46:06,812][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 18:46:06,813][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:46:11,886][__main__][INFO] - Number of regex retries in iteration 302: 0 [2026-03-25 18:46:11,887][__main__][INFO] - agents played in iteration 302 are Bob, Alice [2026-03-25 18:46:12,470][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:46:12,532][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:46:12,533][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:46:12,533][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:46:13,210][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:46:13,839][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:46:14,500][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:46:15,160][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:46:15,818][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:46:16,477][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:46:17,136][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:46:17,798][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:46:18,459][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:46:19,119][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:46:19,780][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:46:20,439][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:46:21,098][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:46:21,758][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:46:22,416][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:46:23,075][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:46:23,734][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:46:24,393][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:46:25,052][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:46:25,711][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:46:26,370][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:46:27,029][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:46:27,688][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:46:28,346][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:46:29,007][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:46:29,665][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:46:30,324][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:46:30,983][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:46:31,642][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:46:32,301][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:46:32,959][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:46:33,620][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:46:34,279][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:46:34,938][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:46:35,596][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:46:36,256][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:46:36,915][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:46:37,574][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:46:38,234][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:46:38,893][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:46:39,552][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:46:40,212][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:46:40,871][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:46:41,531][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:46:42,191][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:46:42,850][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:46:43,511][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:46:44,171][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:46:45,156][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:46:45,815][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:46:46,473][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:46:47,131][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:46:47,789][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:46:48,447][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:46:49,106][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:46:49,764][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:46:50,424][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:46:51,083][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:46:51,742][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:46:52,403][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:46:53,061][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:46:53,721][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:46:54,380][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:46:55,040][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:46:55,701][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:46:56,642][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 18:46:58,119][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:46:58,122][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:46:58,124][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:46:59,436][__main__][INFO] - Iteration 303 took 52s (9.64% Gen, 87.86% Train). Generation: 5s, Training: 46s. Estimated remaining time: 9h 56m 11s. Estimated total time: 14h 37m 5s. Time estimates for 10 more iterations: 8m 46s, 100 more iterations: 1h 27m 42s, 500 more iterations: 7h 18m 32s. [2026-03-25 18:46:59,438][__main__][INFO] - Starting iteration 303. [2026-03-25 18:46:59,442][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 18:46:59,443][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:47:04,799][__main__][INFO] - Number of regex retries in iteration 303: 0 [2026-03-25 18:47:04,800][__main__][INFO] - agents played in iteration 303 are Bob, Alice [2026-03-25 18:47:05,820][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:47:05,882][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:47:05,883][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:47:05,883][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:47:06,754][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:47:07,363][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:47:08,024][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:47:08,683][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:47:09,341][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:47:10,002][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:47:10,660][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:47:11,321][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:47:11,980][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:47:12,639][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:47:13,298][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:47:13,960][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:47:14,620][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:47:15,280][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:47:15,940][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:47:16,600][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:47:17,259][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:47:17,918][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:47:18,577][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:47:19,237][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:47:19,897][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:47:20,557][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:47:21,216][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:47:21,876][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:47:22,536][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:47:23,195][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:47:23,855][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:47:24,513][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:47:25,173][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:47:25,832][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:47:26,491][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:47:27,149][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:47:27,809][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:47:28,468][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:47:29,129][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:47:29,788][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:47:30,449][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:47:31,108][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:47:31,767][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:47:32,426][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:47:33,085][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:47:33,744][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:47:34,404][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:47:35,064][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:47:35,724][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:47:36,382][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:47:37,041][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:47:37,701][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:47:38,687][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:47:39,347][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:47:40,006][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:47:40,665][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:47:41,324][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:47:41,982][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:47:42,641][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:47:43,300][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:47:43,960][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:47:44,619][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:47:45,278][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:47:45,937][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:47:46,595][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:47:47,254][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:47:47,912][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:47:48,571][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:47:49,230][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:47:50,035][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 18:47:51,410][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:47:51,413][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:47:51,414][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:47:52,830][__main__][INFO] - Iteration 304 took 53s (10.03% Gen, 87.31% Train). Generation: 5s, Training: 46s. Estimated remaining time: 10h 8m 2s. Estimated total time: 14h 49m 49s. Time estimates for 10 more iterations: 8m 53s, 100 more iterations: 1h 28m 58s, 500 more iterations: 7h 24m 54s. [2026-03-25 18:47:52,833][__main__][INFO] - Starting iteration 304. [2026-03-25 18:47:52,837][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 18:47:52,837][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:47:59,477][__main__][INFO] - Number of regex retries in iteration 304: 0 [2026-03-25 18:47:59,479][__main__][INFO] - agents played in iteration 304 are Bob, Alice [2026-03-25 18:48:00,379][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:48:00,441][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:48:00,442][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:48:00,442][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:48:01,105][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:48:01,726][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:48:02,371][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:48:03,030][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:48:03,689][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:48:04,350][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:48:05,011][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:48:05,672][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:48:06,332][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:48:06,992][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:48:07,654][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:48:08,314][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:48:08,973][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:48:09,634][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:48:10,293][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:48:10,955][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:48:11,615][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:48:12,274][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:48:12,933][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:48:13,592][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:48:14,252][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:48:14,910][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:48:15,571][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:48:16,231][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:48:16,890][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:48:17,550][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:48:18,210][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:48:18,869][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:48:19,529][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:48:20,192][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:48:20,854][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:48:21,512][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:48:22,174][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:48:22,833][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:48:23,493][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:48:24,152][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:48:24,811][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:48:25,471][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:48:26,130][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:48:26,790][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:48:27,450][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:48:28,110][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:48:28,770][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:48:29,431][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:48:30,091][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:48:30,751][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:48:31,411][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:48:32,071][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:48:33,073][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:48:33,732][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:48:34,392][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:48:35,052][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:48:35,710][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:48:36,368][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:48:37,028][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:48:37,687][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:48:38,346][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:48:39,004][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:48:39,665][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:48:40,322][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:48:40,982][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:48:41,641][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:48:42,300][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:48:42,959][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:48:43,620][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:48:44,357][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 18:48:45,732][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:48:45,735][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:48:45,737][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:48:47,136][__main__][INFO] - Iteration 305 took 54s (12.23% Gen, 85.19% Train). Generation: 6s, Training: 46s. Estimated remaining time: 10h 22m 19s. Estimated total time: 15h 5m 1s. Time estimates for 10 more iterations: 9m 3s, 100 more iterations: 1h 30m 30s, 500 more iterations: 7h 32m 30s. [2026-03-25 18:48:47,139][__main__][INFO] - Starting iteration 305. [2026-03-25 18:48:47,142][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 18:48:47,143][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:48:52,392][__main__][INFO] - Number of regex retries in iteration 305: 0 [2026-03-25 18:48:52,393][__main__][INFO] - agents played in iteration 305 are Bob, Alice [2026-03-25 18:48:53,011][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:48:53,072][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:48:53,073][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:48:53,073][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:48:53,847][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:48:54,463][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:48:55,123][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:48:55,783][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:48:56,443][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:48:57,102][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:48:57,761][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:48:58,422][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:48:59,081][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:48:59,741][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:49:00,402][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:49:01,062][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:49:01,722][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:49:02,383][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:49:03,044][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:49:03,704][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:49:04,364][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:49:05,026][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:49:05,686][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:49:06,345][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:49:07,004][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:49:07,664][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:49:08,325][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:49:08,985][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:49:09,645][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:49:10,305][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:49:10,965][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:49:11,624][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:49:12,283][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:49:12,942][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:49:13,601][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:49:14,261][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:49:14,922][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:49:15,581][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:49:16,242][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:49:16,901][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:49:17,560][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:49:18,220][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:49:18,880][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:49:19,539][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:49:20,198][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:49:20,858][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:49:21,517][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:49:22,177][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:49:22,835][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:49:23,495][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:49:24,155][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:49:24,815][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:49:25,802][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:49:26,460][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:49:27,119][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:49:27,777][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:49:28,436][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:49:29,095][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:49:29,752][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:49:30,410][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:49:31,068][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:49:31,727][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:49:32,384][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:49:33,042][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:49:33,700][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:49:34,359][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:49:35,017][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:49:35,675][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:49:36,333][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:49:37,088][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 18:49:38,464][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:49:38,467][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:49:38,468][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:49:39,842][__main__][INFO] - Iteration 306 took 52s (9.96% Gen, 87.43% Train). Generation: 5s, Training: 46s. Estimated remaining time: 9h 54m 47s. Estimated total time: 14h 38m 21s. Time estimates for 10 more iterations: 8m 47s, 100 more iterations: 1h 27m 50s, 500 more iterations: 7h 19m 10s. [2026-03-25 18:49:39,844][__main__][INFO] - Starting iteration 306. [2026-03-25 18:49:39,848][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 18:49:39,849][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:49:47,895][__main__][INFO] - Number of regex retries in iteration 306: 0 [2026-03-25 18:49:47,897][__main__][INFO] - agents played in iteration 306 are Bob, Alice [2026-03-25 18:49:48,430][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:49:48,491][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:49:48,492][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:49:48,492][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:49:49,351][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:49:49,971][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:49:50,631][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:49:51,290][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:49:51,949][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:49:52,608][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:49:53,267][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:49:53,926][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:49:54,586][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:49:55,247][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:49:55,907][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:49:56,567][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:49:57,230][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:49:57,891][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:49:58,550][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:49:59,212][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:49:59,871][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:50:00,530][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:50:01,191][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:50:01,850][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:50:02,509][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:50:03,168][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:50:03,829][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:50:04,488][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:50:05,148][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:50:05,806][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:50:06,467][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:50:07,128][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:50:07,788][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:50:08,448][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:50:09,108][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:50:09,768][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:50:10,430][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:50:11,090][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:50:11,749][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:50:12,408][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:50:13,068][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:50:13,728][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:50:14,388][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:50:15,048][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:50:15,708][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:50:16,369][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:50:17,030][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:50:17,690][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:50:18,350][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:50:19,010][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:50:19,670][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:50:20,329][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:50:21,311][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:50:21,969][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:50:22,629][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:50:23,287][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:50:23,946][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:50:24,605][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:50:25,263][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:50:25,921][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:50:26,581][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:50:27,241][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:50:27,900][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:50:28,559][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:50:29,218][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:50:29,877][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:50:30,536][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:50:31,195][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:50:31,854][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:50:32,657][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 18:50:33,984][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:50:33,986][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:50:33,988][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:50:35,495][__main__][INFO] - Iteration 307 took 55s (14.46% Gen, 82.82% Train). Generation: 8s, Training: 46s. Estimated remaining time: 10h 42m 58s. Estimated total time: 15h 27m 28s. Time estimates for 10 more iterations: 9m 16s, 100 more iterations: 1h 32m 44s, 500 more iterations: 7h 43m 44s. [2026-03-25 18:50:35,502][__main__][INFO] - Starting iteration 307. [2026-03-25 18:50:35,507][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 18:50:35,507][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:50:41,535][__main__][INFO] - Number of regex retries in iteration 307: 0 [2026-03-25 18:50:41,536][__main__][INFO] - agents played in iteration 307 are Bob, Alice [2026-03-25 18:50:42,044][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:50:42,105][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:50:42,106][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:50:42,106][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:50:42,922][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:50:43,537][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:50:44,199][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:50:44,858][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:50:45,518][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:50:46,176][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:50:46,835][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:50:47,495][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:50:48,153][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:50:48,814][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:50:49,472][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:50:50,133][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:50:50,793][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:50:51,452][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:50:52,111][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:50:52,770][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:50:53,429][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:50:54,089][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:50:54,747][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:50:55,406][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:50:56,065][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:50:56,723][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:50:57,383][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:50:58,042][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:50:58,702][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:50:59,363][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:51:00,022][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:51:00,681][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:51:01,340][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:51:01,998][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:51:02,657][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:51:03,316][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:51:03,975][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:51:04,635][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:51:05,294][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:51:05,952][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:51:06,611][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:51:07,270][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:51:07,929][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:51:08,587][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:51:09,246][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:51:09,905][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:51:10,564][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:51:11,222][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:51:11,881][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:51:12,539][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:51:13,199][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:51:13,858][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:51:14,844][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:51:15,502][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:51:16,159][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:51:16,818][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:51:17,476][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:51:18,134][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:51:18,794][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:51:19,450][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:51:20,107][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:51:20,765][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:51:21,423][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:51:22,082][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:51:22,740][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:51:23,398][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:51:24,057][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:51:24,715][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:51:25,373][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:51:26,206][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 18:51:28,176][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:51:28,179][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:51:28,180][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:51:29,614][__main__][INFO] - Iteration 308 took 54s (11.14% Gen, 86.20% Train). Generation: 6s, Training: 46s. Estimated remaining time: 10h 16m 25s. Estimated total time: 15h 1m 50s. Time estimates for 10 more iterations: 9m 1s, 100 more iterations: 1h 30m 11s, 500 more iterations: 7h 30m 55s. [2026-03-25 18:51:29,617][__main__][INFO] - Starting iteration 308. [2026-03-25 18:51:29,622][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 18:51:29,622][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:51:39,295][mllm.models.large_language_model_local][WARNING] - Response user Last round, the other agent played . did not match regex: (|), retry 1/1 [2026-03-25 18:51:45,118][__main__][INFO] - Number of regex retries in iteration 308: 1 [2026-03-25 18:51:45,119][__main__][INFO] - agents played in iteration 308 are Bob, Alice [2026-03-25 18:51:45,773][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:51:45,836][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:51:45,837][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:51:45,838][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:51:46,520][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:51:47,142][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:51:47,805][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:51:48,465][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:51:49,122][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:51:49,781][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:51:50,440][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:51:51,101][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:51:51,761][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:51:52,419][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:51:53,079][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:51:53,741][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:51:54,401][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:51:55,062][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:51:55,723][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:51:56,383][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:51:57,044][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:51:57,703][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:51:58,363][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:51:59,027][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:51:59,690][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:52:00,348][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:52:01,008][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:52:01,669][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:52:02,329][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:52:02,989][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:52:03,648][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:52:04,308][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:52:04,967][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:52:05,667][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:52:06,327][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:52:06,986][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:52:07,645][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:52:08,305][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:52:08,964][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:52:09,625][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:52:10,285][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:52:10,944][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:52:11,604][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:52:12,264][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:52:12,922][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:52:13,581][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:52:14,240][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:52:14,899][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:52:15,558][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:52:16,217][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:52:16,877][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:52:17,536][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:52:18,528][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:52:19,189][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:52:19,848][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:52:20,508][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:52:21,167][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:52:21,827][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:52:22,486][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:52:23,144][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:52:23,801][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:52:24,461][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:52:25,120][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:52:25,780][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:52:26,439][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:52:27,098][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:52:27,757][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:52:28,417][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:52:29,076][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:52:29,905][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 18:52:31,268][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:52:31,272][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:52:31,273][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:52:32,777][__main__][INFO] - Iteration 309 took 1m 3s (24.54% Gen, 73.08% Train). Generation: 15s, Training: 46s. Estimated remaining time: 12h 46m 10s. Estimated total time: 17h 32m 38s. Time estimates for 10 more iterations: 10m 31s, 100 more iterations: 1h 45m 15s, 500 more iterations: 8h 46m 19s. [2026-03-25 18:52:32,780][__main__][INFO] - Starting iteration 309. [2026-03-25 18:52:32,784][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 18:52:32,785][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:52:37,823][__main__][INFO] - Number of regex retries in iteration 309: 0 [2026-03-25 18:52:37,824][__main__][INFO] - agents played in iteration 309 are Bob, Alice [2026-03-25 18:52:38,367][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:52:38,428][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:52:38,429][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:52:38,429][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:52:39,107][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:52:39,721][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:52:40,382][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:52:41,041][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:52:41,701][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:52:42,360][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:52:43,021][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:52:43,680][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:52:44,341][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:52:44,999][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:52:45,658][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:52:46,317][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:52:46,976][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:52:47,635][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:52:48,295][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:52:48,954][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:52:49,614][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:52:50,274][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:52:50,933][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:52:51,592][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:52:52,252][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:52:52,912][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:52:53,571][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:52:54,230][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:52:54,889][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:52:55,549][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:52:56,209][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:52:56,868][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:52:57,528][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:52:58,188][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:52:58,849][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:52:59,509][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:53:00,168][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:53:00,827][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:53:01,487][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:53:02,147][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:53:02,805][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:53:03,465][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:53:04,125][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:53:04,785][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:53:05,445][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:53:06,104][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:53:06,765][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:53:07,424][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:53:08,084][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:53:08,743][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:53:09,403][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:53:10,063][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:53:11,039][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:53:11,697][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:53:12,355][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:53:13,016][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:53:13,673][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:53:14,333][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:53:14,993][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:53:15,653][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:53:16,312][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:53:16,971][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:53:17,630][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:53:18,290][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:53:18,949][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:53:19,607][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:53:20,265][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:53:20,924][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:53:21,583][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:53:22,306][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 18:53:23,738][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:53:23,741][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:53:23,743][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:53:25,179][__main__][INFO] - Iteration 310 took 52s (9.62% Gen, 87.63% Train). Generation: 5s, Training: 45s. Estimated remaining time: 9h 45m 57s. Estimated total time: 14h 33m 17s. Time estimates for 10 more iterations: 8m 43s, 100 more iterations: 1h 27m 19s, 500 more iterations: 7h 16m 38s. [2026-03-25 18:53:25,181][__main__][INFO] - Starting iteration 310. [2026-03-25 18:53:25,185][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 18:53:25,186][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:53:37,906][__main__][INFO] - Number of regex retries in iteration 310: 0 [2026-03-25 18:53:37,908][__main__][INFO] - agents played in iteration 310 are Bob, Alice [2026-03-25 18:53:38,387][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:53:38,449][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:53:38,449][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:53:38,450][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:53:39,246][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:53:39,858][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:53:40,519][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:53:41,178][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:53:41,838][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:53:42,498][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:53:43,157][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:53:43,816][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:53:44,476][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:53:45,136][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:53:45,796][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:53:46,456][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:53:47,114][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:53:47,773][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:53:48,432][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:53:49,090][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:53:49,750][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:53:50,408][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:53:51,067][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:53:51,727][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:53:52,387][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:53:53,046][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:53:53,706][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:53:54,364][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:53:55,022][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:53:55,680][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:53:56,340][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:53:56,998][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:53:57,658][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:53:58,317][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:53:58,977][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:53:59,636][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:54:00,295][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:54:00,954][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:54:01,614][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:54:02,272][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:54:02,931][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:54:03,589][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:54:04,249][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:54:04,907][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:54:05,567][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:54:06,225][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:54:06,884][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:54:07,543][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:54:08,201][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:54:08,861][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:54:09,519][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:54:10,178][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:54:11,159][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:54:11,819][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:54:12,476][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:54:13,134][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:54:13,793][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:54:14,451][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:54:15,110][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:54:15,768][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:54:16,428][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:54:17,086][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:54:17,744][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:54:18,403][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:54:19,061][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:54:19,718][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:54:20,376][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:54:21,034][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:54:21,692][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:54:22,466][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 18:54:23,848][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:54:23,851][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:54:23,852][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:54:25,273][__main__][INFO] - Iteration 311 took 1m 0s (21.17% Gen, 76.46% Train). Generation: 12s, Training: 45s. Estimated remaining time: 11h 53m 10s. Estimated total time: 16h 41m 30s. Time estimates for 10 more iterations: 10m 0s, 100 more iterations: 1h 40m 9s, 500 more iterations: 8h 20m 45s. [2026-03-25 18:54:25,276][__main__][INFO] - Starting iteration 311. [2026-03-25 18:54:25,283][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 18:54:25,284][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:54:34,942][__main__][INFO] - Number of regex retries in iteration 311: 0 [2026-03-25 18:54:34,943][__main__][INFO] - agents played in iteration 311 are Bob, Alice [2026-03-25 18:54:36,112][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:54:36,181][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:54:36,182][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:54:36,182][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:54:36,855][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:54:37,474][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:54:38,134][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:54:38,795][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:54:39,457][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:54:40,116][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:54:40,774][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:54:41,434][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:54:42,094][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:54:42,756][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:54:43,416][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:54:44,075][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:54:44,735][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:54:45,394][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:54:46,053][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:54:46,713][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:54:47,372][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:54:48,032][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:54:48,692][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:54:49,351][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:54:50,011][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:54:50,671][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:54:51,331][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:54:51,991][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:54:52,650][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:54:53,310][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:54:53,970][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:54:54,631][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:54:55,292][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:54:55,952][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:54:56,611][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:54:57,270][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:54:57,929][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:54:58,588][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:54:59,249][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:54:59,910][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:55:00,570][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:55:01,229][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:55:01,889][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:55:02,548][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:55:03,209][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:55:03,869][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:55:04,529][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:55:05,190][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:55:05,850][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:55:06,510][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:55:07,170][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:55:07,830][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:55:08,815][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:55:09,476][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:55:10,135][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:55:10,794][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:55:11,452][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:55:12,110][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:55:12,770][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:55:13,431][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:55:14,088][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:55:14,749][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:55:15,409][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:55:16,069][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:55:16,728][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:55:17,386][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:55:18,046][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:55:18,704][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:55:19,363][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:55:20,073][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 18:55:21,422][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:55:21,425][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:55:21,427][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:55:22,831][__main__][INFO] - Iteration 312 took 57s (16.78% Gen, 80.77% Train). Generation: 9s, Training: 46s. Estimated remaining time: 11h 9m 53s. Estimated total time: 15h 59m 10s. Time estimates for 10 more iterations: 9m 35s, 100 more iterations: 1h 35m 55s, 500 more iterations: 7h 59m 35s. [2026-03-25 18:55:22,833][__main__][INFO] - Starting iteration 312. [2026-03-25 18:55:22,838][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 18:55:22,838][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:55:28,260][__main__][INFO] - Number of regex retries in iteration 312: 0 [2026-03-25 18:55:28,262][__main__][INFO] - agents played in iteration 312 are Bob, Alice [2026-03-25 18:55:28,851][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:55:28,914][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:55:28,914][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:55:28,915][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:55:29,616][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:55:30,224][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:55:30,885][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:55:31,543][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:55:32,204][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:55:32,864][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:55:33,523][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:55:34,184][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:55:34,843][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:55:35,503][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:55:36,162][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:55:36,822][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:55:37,483][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:55:38,144][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:55:38,805][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:55:39,465][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:55:40,125][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:55:40,786][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:55:41,445][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:55:42,106][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:55:42,769][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:55:43,428][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:55:44,088][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:55:44,751][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:55:45,412][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:55:46,075][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:55:46,735][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:55:47,396][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:55:48,058][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:55:48,719][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:55:49,381][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:55:50,042][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:55:50,701][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:55:51,362][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:55:52,023][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:55:52,682][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:55:53,343][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:55:54,003][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:55:54,666][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:55:55,329][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:55:55,988][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:55:56,648][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:55:57,308][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:55:57,968][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:55:58,631][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:55:59,293][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:55:59,954][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:56:00,614][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:56:01,606][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:56:02,268][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:56:02,927][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:56:03,587][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:56:04,248][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:56:04,907][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:56:05,567][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:56:06,226][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:56:06,884][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:56:07,543][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:56:08,201][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:56:08,859][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:56:09,518][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:56:10,177][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:56:10,836][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:56:11,494][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:56:12,155][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:56:12,930][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 18:56:14,318][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:56:14,322][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:56:14,323][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:56:25,657][__main__][INFO] - Iteration 313 took 1m 2s (8.63% Gen, 73.32% Train). Generation: 5s, Training: 46s. Estimated remaining time: 12h 36m 41s. Estimated total time: 17h 27m 1s. Time estimates for 10 more iterations: 10m 28s, 100 more iterations: 1h 44m 42s, 500 more iterations: 8h 43m 30s. [2026-03-25 18:56:25,660][__main__][INFO] - Starting iteration 313. [2026-03-25 18:56:25,665][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 18:56:25,666][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:56:32,169][__main__][INFO] - Number of regex retries in iteration 313: 0 [2026-03-25 18:56:32,170][__main__][INFO] - agents played in iteration 313 are Bob, Alice [2026-03-25 18:56:33,279][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:56:33,340][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:56:33,340][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:56:33,341][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:56:34,130][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:56:34,743][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:56:35,404][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:56:36,064][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:56:36,724][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:56:37,385][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:56:38,046][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:56:38,705][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:56:39,365][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:56:40,026][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:56:40,687][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:56:41,347][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:56:42,007][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:56:42,665][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:56:43,326][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:56:43,987][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:56:44,648][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:56:45,308][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:56:45,968][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:56:46,628][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:56:47,289][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:56:47,949][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:56:48,610][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:56:49,272][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:56:49,931][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:56:50,591][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:56:51,252][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:56:51,911][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:56:52,571][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:56:53,231][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:56:53,891][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:56:54,551][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:56:55,212][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:56:55,871][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:56:56,530][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:56:57,190][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:56:57,851][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:56:58,511][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:56:59,170][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:56:59,829][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:57:00,488][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:57:01,149][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:57:01,809][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:57:02,468][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:57:03,129][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:57:03,789][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:57:04,450][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:57:05,109][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:57:06,102][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:57:06,761][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:57:07,421][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:57:08,080][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:57:08,740][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:57:09,400][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:57:10,063][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:57:10,721][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:57:11,382][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:57:12,042][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:57:12,703][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:57:13,363][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:57:14,023][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:57:14,683][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:57:15,345][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:57:16,006][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:57:16,664][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:57:17,477][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 18:57:18,853][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:57:18,856][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:57:18,857][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:57:20,262][__main__][INFO] - Iteration 314 took 54s (11.91% Gen, 85.51% Train). Generation: 6s, Training: 46s. Estimated remaining time: 10h 18m 45s. Estimated total time: 15h 10m 0s. Time estimates for 10 more iterations: 9m 6s, 100 more iterations: 1h 31m 0s, 500 more iterations: 7h 35m 0s. [2026-03-25 18:57:20,264][__main__][INFO] - Starting iteration 314. [2026-03-25 18:57:20,268][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 18:57:20,269][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:57:28,320][__main__][INFO] - Number of regex retries in iteration 314: 0 [2026-03-25 18:57:28,322][__main__][INFO] - agents played in iteration 314 are Bob, Alice [2026-03-25 18:57:28,801][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:57:28,863][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:57:28,864][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:57:28,864][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:57:29,735][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:57:30,355][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:57:31,018][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:57:31,678][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:57:32,338][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:57:33,999][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:57:33,659][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:57:34,319][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:57:34,981][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:57:35,641][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:57:36,301][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:57:36,960][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:57:37,620][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:57:38,281][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:57:38,942][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:57:39,601][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:57:40,261][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:57:40,920][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:57:41,580][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:57:42,240][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:57:42,901][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:57:43,561][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:57:44,220][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:57:44,879][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:57:45,541][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:57:46,201][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:57:46,860][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:57:47,519][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:57:48,179][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:57:48,839][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:57:49,499][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:57:50,160][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:57:50,820][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:57:51,480][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:57:52,141][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:57:52,801][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:57:53,460][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:57:54,119][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:57:54,779][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:57:55,439][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:57:56,099][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:57:56,758][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:57:57,417][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:57:58,077][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:57:58,738][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:57:59,399][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:58:00,060][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:58:00,721][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:58:01,709][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:58:02,368][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:58:03,027][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:58:03,686][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:58:04,345][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:58:05,004][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:58:05,663][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:58:06,322][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:58:06,980][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:58:07,639][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:58:08,297][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:58:09,102][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:58:09,762][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:58:10,421][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:58:11,079][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:58:11,738][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:58:12,397][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:58:13,201][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 18:58:14,570][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:58:14,573][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:58:14,574][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:58:15,982][__main__][INFO] - Iteration 315 took 55s (14.45% Gen, 83.02% Train). Generation: 8s, Training: 46s. Estimated remaining time: 10h 36m 24s. Estimated total time: 15h 28m 35s. Time estimates for 10 more iterations: 9m 17s, 100 more iterations: 1h 32m 51s, 500 more iterations: 7h 44m 17s. [2026-03-25 18:58:15,984][__main__][INFO] - Starting iteration 315. [2026-03-25 18:58:15,987][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 18:58:15,988][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:58:20,982][__main__][INFO] - Number of regex retries in iteration 315: 0 [2026-03-25 18:58:20,983][__main__][INFO] - agents played in iteration 315 are Bob, Alice [2026-03-25 18:58:21,581][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:58:21,643][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:58:21,643][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:58:21,644][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:58:22,402][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:58:23,031][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:58:23,691][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:58:24,352][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:58:25,013][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:58:25,674][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:58:26,332][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:58:26,991][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:58:27,652][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:58:28,312][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:58:28,974][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:58:29,634][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:58:30,293][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:58:30,954][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:58:31,613][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:58:32,274][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:58:32,933][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:58:33,593][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:58:34,253][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:58:34,913][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:58:35,572][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:58:36,232][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:58:36,892][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:58:37,552][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:58:38,211][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:58:38,871][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:58:39,531][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:58:40,191][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:58:40,851][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:58:41,510][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:58:42,169][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:58:42,829][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:58:43,489][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:58:44,148][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:58:44,807][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:58:45,467][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:58:46,127][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:58:46,786][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:58:47,447][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:58:48,107][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:58:48,766][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:58:49,426][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:58:50,087][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:58:50,747][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:58:51,407][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:58:52,065][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:58:52,726][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:58:53,385][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:58:54,372][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:58:55,035][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:58:55,693][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:58:56,354][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:58:57,013][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:58:57,674][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:58:58,332][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:58:58,991][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:58:59,651][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:59:00,312][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:59:00,970][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:59:01,629][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:59:02,288][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:59:02,947][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:59:03,606][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:59:04,267][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:59:04,927][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:59:05,678][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 18:59:07,068][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:59:07,071][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:59:07,072][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:59:08,485][__main__][INFO] - Iteration 316 took 52s (9.51% Gen, 87.79% Train). Generation: 4s, Training: 46s. Estimated remaining time: 9h 41m 55s. Estimated total time: 14h 34m 58s. Time estimates for 10 more iterations: 8m 44s, 100 more iterations: 1h 27m 29s, 500 more iterations: 7h 17m 29s. [2026-03-25 18:59:08,487][__main__][INFO] - Starting iteration 316. [2026-03-25 18:59:08,490][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 18:59:08,491][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:59:13,528][__main__][INFO] - Number of regex retries in iteration 316: 0 [2026-03-25 18:59:13,528][__main__][INFO] - agents played in iteration 316 are Bob, Alice [2026-03-25 18:59:14,015][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:59:14,077][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 18:59:14,077][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:59:14,078][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:59:14,932][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:59:15,547][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:59:16,208][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:59:16,868][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:59:17,527][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:59:18,185][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:59:18,844][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:59:19,505][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:59:20,164][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:59:20,823][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:59:21,482][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:59:22,142][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:59:22,802][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:59:23,463][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:59:24,122][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:59:24,781][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:59:25,440][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:59:26,100][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:59:26,761][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:59:27,420][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:59:28,080][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:59:28,741][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:59:29,400][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:59:30,058][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:59:30,718][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:59:31,378][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:59:32,037][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:59:32,697][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:59:33,358][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:59:34,018][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:59:34,679][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:59:35,339][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:59:36,004][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:59:36,662][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:59:37,321][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:59:37,980][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:59:38,639][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:59:39,299][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:59:39,959][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:59:40,619][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:59:41,279][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:59:41,938][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:59:42,598][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:59:43,257][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:59:43,918][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:59:44,579][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:59:45,239][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:59:45,900][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:59:46,898][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:59:47,558][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:59:48,217][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:59:48,876][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:59:49,538][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:59:50,198][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:59:50,857][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:59:51,516][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:59:52,176][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:59:52,835][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:59:53,494][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:59:54,153][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:59:54,813][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:59:55,473][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:59:56,131][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:59:56,789][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:59:57,448][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:59:58,265][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 18:59:59,695][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:59:59,698][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:59:59,699][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:00:01,154][__main__][INFO] - Iteration 317 took 52s (9.57% Gen, 87.67% Train). Generation: 5s, Training: 46s. Estimated remaining time: 9h 43m 48s. Estimated total time: 14h 37m 44s. Time estimates for 10 more iterations: 8m 46s, 100 more iterations: 1h 27m 46s, 500 more iterations: 7h 18m 52s. [2026-03-25 19:00:01,156][__main__][INFO] - Starting iteration 317. [2026-03-25 19:00:01,160][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 19:00:01,160][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:00:09,037][__main__][INFO] - Number of regex retries in iteration 317: 0 [2026-03-25 19:00:09,038][__main__][INFO] - agents played in iteration 317 are Bob, Alice [2026-03-25 19:00:10,136][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:00:10,199][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:00:10,200][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:00:10,200][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:00:10,878][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:00:11,497][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:00:12,157][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:00:12,816][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:00:13,478][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:00:14,138][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:00:14,797][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:00:15,457][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:00:16,116][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:00:16,777][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:00:17,438][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:00:18,097][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:00:18,758][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:00:19,418][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:00:20,077][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:00:20,737][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:00:21,397][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:00:22,057][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:00:22,718][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:00:23,379][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:00:24,038][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:00:25,003][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:00:25,662][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:00:26,321][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:00:26,979][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:00:27,637][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:00:28,297][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:00:28,956][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:00:29,614][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:00:30,272][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:00:30,931][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:00:31,589][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:00:32,248][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:00:32,906][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:00:33,565][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:00:34,222][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:00:34,880][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:00:35,539][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:00:36,198][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:00:36,856][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:00:37,515][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:00:38,173][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:00:38,831][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:00:39,489][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:00:40,148][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:00:40,809][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:00:41,468][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:00:42,128][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:00:43,140][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:00:43,799][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:00:44,458][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:00:45,118][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:00:45,776][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:00:46,435][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:00:47,095][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:00:47,754][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:00:48,414][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:00:49,073][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:00:49,732][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:00:50,391][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:00:51,050][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:00:51,708][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:00:52,367][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:00:53,026][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:00:53,684][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:00:54,463][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 19:00:55,926][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:00:55,929][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:00:55,930][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:00:57,354][__main__][INFO] - Iteration 318 took 56s (14.02% Gen, 83.45% Train). Generation: 7s, Training: 46s. Estimated remaining time: 10h 41m 44s. Estimated total time: 15h 36m 36s. Time estimates for 10 more iterations: 9m 21s, 100 more iterations: 1h 33m 39s, 500 more iterations: 7h 48m 18s. [2026-03-25 19:00:57,357][__main__][INFO] - Starting iteration 318. [2026-03-25 19:00:57,360][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 19:00:57,361][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:01:02,399][__main__][INFO] - Number of regex retries in iteration 318: 0 [2026-03-25 19:01:02,400][__main__][INFO] - agents played in iteration 318 are Bob, Alice [2026-03-25 19:01:02,885][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:01:02,949][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:01:02,950][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:01:02,950][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:01:03,668][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:01:04,283][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:01:04,942][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:01:05,603][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:01:06,263][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:01:06,924][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:01:07,585][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:01:08,246][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:01:08,906][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:01:09,567][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:01:10,227][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:01:10,887][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:01:11,547][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:01:12,207][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:01:12,869][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:01:13,530][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:01:14,191][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:01:14,851][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:01:15,512][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:01:16,172][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:01:16,832][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:01:17,492][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:01:18,152][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:01:18,812][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:01:19,472][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:01:20,132][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:01:20,792][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:01:21,451][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:01:22,111][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:01:22,771][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:01:23,431][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:01:24,093][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:01:24,754][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:01:25,413][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:01:26,074][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:01:26,734][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:01:27,394][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:01:28,053][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:01:28,712][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:01:29,372][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:01:30,033][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:01:30,694][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:01:31,353][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:01:32,012][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:01:32,671][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:01:33,331][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:01:33,991][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:01:34,651][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:01:35,635][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:01:36,295][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:01:36,956][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:01:37,616][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:01:38,276][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:01:38,935][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:01:39,595][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:01:40,254][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:01:40,913][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:01:41,572][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:01:42,231][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:01:42,891][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:01:43,552][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:01:44,210][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:01:44,873][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:01:45,532][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:01:46,193][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:01:47,013][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 19:01:48,432][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:01:48,435][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:01:48,436][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:01:49,827][__main__][INFO] - Iteration 319 took 52s (9.60% Gen, 87.74% Train). Generation: 5s, Training: 46s. Estimated remaining time: 9h 38m 43s. Estimated total time: 14h 34m 28s. Time estimates for 10 more iterations: 8m 44s, 100 more iterations: 1h 27m 26s, 500 more iterations: 7h 17m 14s. [2026-03-25 19:01:50,225][__main__][INFO] - Starting iteration 319. [2026-03-25 19:01:50,229][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 19:01:50,229][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:01:55,590][__main__][INFO] - Number of regex retries in iteration 319: 0 [2026-03-25 19:01:55,591][__main__][INFO] - agents played in iteration 319 are Bob, Alice [2026-03-25 19:01:56,232][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:01:56,295][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:01:56,295][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:01:56,296][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:01:56,976][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:01:57,599][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:01:58,255][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:01:58,914][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:01:59,572][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:02:00,232][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:02:00,892][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:02:01,553][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:02:02,212][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:02:02,871][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:02:03,539][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:02:04,194][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:02:04,853][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:02:05,513][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:02:06,172][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:02:06,831][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:02:07,490][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:02:08,148][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:02:08,807][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:02:09,467][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:02:10,127][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:02:10,787][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:02:11,446][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:02:12,105][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:02:13,712][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:02:14,370][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:02:15,029][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:02:15,686][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:02:16,345][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:02:17,005][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:02:17,665][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:02:18,325][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:02:18,984][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:02:19,643][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:02:20,303][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:02:20,961][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:02:21,620][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:02:22,279][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:02:22,938][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:02:23,596][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:02:24,255][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:02:24,913][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:02:25,571][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:02:26,231][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:02:26,891][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:02:27,550][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:02:28,209][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:02:28,869][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:02:29,855][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:02:30,514][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:02:31,173][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:02:31,832][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:02:32,492][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:02:33,150][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:02:33,809][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:02:34,468][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:02:35,129][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:02:35,787][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:02:36,448][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:02:37,107][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:02:37,765][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:02:38,425][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:02:39,083][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:02:39,746][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:02:40,405][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:02:41,239][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:44 [2026-03-25 19:02:44,544][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:02:44,547][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:02:45,531][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:02:46,909][__main__][INFO] - Iteration 320 took 56s (9.46% Gen, 88.11% Train). Generation: 5s, Training: 49s. Estimated remaining time: 10h 48m 0s. Estimated total time: 15h 44m 42s. Time estimates for 10 more iterations: 9m 26s, 100 more iterations: 1h 34m 28s, 500 more iterations: 7h 52m 21s. [2026-03-25 19:02:46,913][__main__][INFO] - Starting iteration 320. [2026-03-25 19:02:46,918][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 19:02:46,918][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:02:52,935][__main__][INFO] - Number of regex retries in iteration 320: 0 [2026-03-25 19:02:52,936][__main__][INFO] - agents played in iteration 320 are Bob, Alice [2026-03-25 19:02:53,818][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:02:53,979][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:02:53,979][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:02:53,980][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:02:54,866][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:02:55,476][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:02:56,136][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:02:56,798][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:02:57,459][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:02:58,120][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:02:58,783][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:02:59,443][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:03:00,103][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:03:00,764][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:03:01,424][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:03:02,083][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:03:02,743][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:03:03,402][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:03:04,063][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:03:04,723][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:03:05,382][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:03:06,041][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:03:06,701][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:03:07,361][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:03:08,019][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:03:08,678][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:03:09,339][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:03:09,998][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:03:10,658][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:03:11,317][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:03:11,976][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:03:12,635][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:03:13,295][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:03:13,954][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:03:14,614][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:03:15,273][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:03:15,936][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:03:16,596][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:03:17,257][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:03:17,917][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:03:18,576][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:03:19,236][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:03:19,896][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:03:20,558][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:03:21,217][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:03:21,876][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:03:22,536][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:03:23,195][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:03:23,855][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:03:24,514][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:03:25,175][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:03:25,834][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:03:26,859][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:03:27,517][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:03:28,176][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:03:28,837][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:03:29,495][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:03:30,154][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:03:30,815][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:03:32,633][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:03:33,293][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:03:33,951][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:03:34,611][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:03:35,269][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:03:35,927][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:03:36,585][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:03:37,244][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:03:37,903][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:03:38,562][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:03:39,351][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:44 [2026-03-25 19:03:40,743][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:03:40,746][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:03:40,747][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:03:42,249][__main__][INFO] - Iteration 321 took 55s (10.88% Gen, 86.40% Train). Generation: 6s, Training: 47s. Estimated remaining time: 10h 24m 36s. Estimated total time: 15h 22m 13s. Time estimates for 10 more iterations: 9m 13s, 100 more iterations: 1h 32m 13s, 500 more iterations: 7h 41m 6s. [2026-03-25 19:03:42,251][__main__][INFO] - Starting iteration 321. [2026-03-25 19:03:42,255][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 19:03:42,255][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:03:48,580][__main__][INFO] - Number of regex retries in iteration 321: 0 [2026-03-25 19:03:48,581][__main__][INFO] - agents played in iteration 321 are Bob, Alice [2026-03-25 19:03:49,052][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:03:49,113][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:03:49,114][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:03:49,114][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:03:49,847][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:03:50,473][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:03:51,131][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:03:51,789][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:03:52,449][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:03:53,109][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:03:53,768][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:03:54,428][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:03:55,088][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:03:55,747][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:03:56,405][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:03:57,067][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:03:57,726][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:03:58,386][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:03:59,046][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:03:59,704][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:04:00,364][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:04:01,023][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:04:01,683][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:04:02,342][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:04:03,001][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:04:03,661][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:04:04,320][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:04:04,979][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:04:05,638][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:04:06,298][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:04:06,957][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:04:07,615][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:04:08,274][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:04:08,933][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:04:09,593][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:04:10,254][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:04:10,915][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:04:11,577][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:04:12,236][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:04:12,895][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:04:13,554][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:04:14,214][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:04:14,873][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:04:15,532][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:04:16,192][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:04:16,851][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:04:17,510][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:04:18,170][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:04:18,829][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:04:19,493][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:04:20,151][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:04:20,810][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:04:21,821][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:04:22,481][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:04:23,140][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:04:23,799][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:04:24,457][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:04:25,118][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:04:25,777][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:04:26,436][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:04:27,095][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:04:27,760][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:04:28,417][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:04:29,076][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:04:29,736][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:04:30,395][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:04:31,054][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:04:31,713][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:04:32,373][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:04:33,158][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 19:04:34,598][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:04:34,600][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:04:34,602][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:04:37,662][__main__][INFO] - Iteration 322 took 55s (11.42% Gen, 83.06% Train). Generation: 6s, Training: 46s. Estimated remaining time: 10h 24m 56s. Estimated total time: 15h 23m 28s. Time estimates for 10 more iterations: 9m 14s, 100 more iterations: 1h 32m 20s, 500 more iterations: 7h 41m 44s. [2026-03-25 19:04:37,664][__main__][INFO] - Starting iteration 322. [2026-03-25 19:04:37,669][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 19:04:37,669][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:04:43,336][__main__][INFO] - Number of regex retries in iteration 322: 0 [2026-03-25 19:04:43,337][__main__][INFO] - agents played in iteration 322 are Bob, Alice [2026-03-25 19:04:43,924][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:04:43,984][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:04:43,985][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:04:43,986][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:04:44,709][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:04:45,324][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:04:45,984][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:04:46,643][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:04:47,303][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:04:47,962][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:04:48,621][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:04:49,284][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:04:49,947][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:04:50,604][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:04:51,265][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:04:51,927][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:04:52,588][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:04:53,247][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:04:53,908][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:04:54,568][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:04:55,232][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:04:55,891][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:04:56,552][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:04:57,212][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:04:57,872][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:04:58,531][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:04:59,192][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:04:59,852][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:05:00,513][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:05:01,174][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:05:01,834][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:05:02,765][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:05:03,423][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:05:04,082][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:05:08,164][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:05:08,824][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:05:09,483][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:05:10,144][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:05:10,803][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:05:11,462][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:05:12,121][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:05:12,781][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:05:13,440][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:05:14,100][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:05:14,760][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:05:15,420][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:05:16,080][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:05:16,740][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:05:17,401][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:05:18,060][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:05:18,720][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:05:19,378][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:05:20,357][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:05:21,016][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:05:21,675][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:05:22,336][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:05:22,996][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:05:23,655][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:05:24,313][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:05:24,973][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:05:25,632][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:05:26,290][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:05:26,949][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:05:27,607][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:05:28,266][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:05:28,925][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:05:29,584][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:05:30,242][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:05:30,900][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:05:31,722][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:47 [2026-03-25 19:05:33,152][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:05:33,155][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:05:33,157][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:05:34,588][__main__][INFO] - Iteration 323 took 56s (9.96% Gen, 87.52% Train). Generation: 5s, Training: 49s. Estimated remaining time: 10h 49m 13s. Estimated total time: 15h 48m 42s. Time estimates for 10 more iterations: 9m 29s, 100 more iterations: 1h 34m 52s, 500 more iterations: 7h 54m 21s. [2026-03-25 19:05:34,591][__main__][INFO] - Starting iteration 323. [2026-03-25 19:05:34,595][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 19:05:34,596][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:05:39,990][__main__][INFO] - Number of regex retries in iteration 323: 0 [2026-03-25 19:05:39,991][__main__][INFO] - agents played in iteration 323 are Bob, Alice [2026-03-25 19:05:40,878][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:05:40,940][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:05:40,940][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:05:40,941][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:05:41,800][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:05:42,428][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:05:43,089][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:05:43,750][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:05:44,411][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:05:45,073][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:05:45,732][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:05:46,393][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:05:47,052][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:05:47,713][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:05:48,876][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:05:50,960][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:05:51,616][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:05:52,275][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:05:52,933][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:05:53,616][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:05:54,636][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:05:55,294][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:05:55,954][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:05:56,613][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:05:57,271][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:05:57,929][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:05:58,590][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:05:59,249][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:05:59,907][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:06:00,565][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:06:01,223][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:06:01,882][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:06:02,539][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:06:03,198][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:06:03,857][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:06:04,516][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:06:05,175][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:06:05,835][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:06:06,493][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:06:07,151][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:06:07,810][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:06:08,469][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:06:09,128][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:06:09,787][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:06:10,446][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:06:11,105][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:06:11,764][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:06:12,422][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:06:13,081][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:06:13,740][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:06:15,307][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:06:15,967][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:06:16,954][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:06:17,612][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:06:18,272][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:06:18,928][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:06:19,588][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:06:20,247][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:06:20,906][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:06:21,566][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:06:22,225][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:06:22,885][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:06:23,545][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:06:24,203][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:06:24,861][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:06:25,521][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:06:26,180][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:06:26,838][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:06:27,496][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:06:28,349][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:46 [2026-03-25 19:06:29,767][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:06:29,770][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:06:29,772][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:06:31,142][__main__][INFO] - Iteration 324 took 56s (9.54% Gen, 88.03% Train). Generation: 5s, Training: 49s. Estimated remaining time: 10h 42m 3s. Estimated total time: 15h 42m 29s. Time estimates for 10 more iterations: 9m 25s, 100 more iterations: 1h 34m 14s, 500 more iterations: 7h 51m 14s. [2026-03-25 19:06:31,144][__main__][INFO] - Starting iteration 324. [2026-03-25 19:06:31,148][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 19:06:31,148][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:06:36,227][__main__][INFO] - Number of regex retries in iteration 324: 0 [2026-03-25 19:06:36,228][__main__][INFO] - agents played in iteration 324 are Bob, Alice [2026-03-25 19:06:36,731][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:06:36,792][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:06:36,793][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:06:36,794][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:06:37,601][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:06:38,214][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:06:38,875][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:06:39,536][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:06:40,196][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:06:40,856][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:06:41,516][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:06:42,176][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:06:42,835][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:06:43,494][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:06:44,153][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:06:44,813][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:06:45,475][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:06:46,134][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:06:46,793][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:06:47,452][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:06:48,111][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:06:48,769][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:06:49,428][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:06:50,088][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:06:50,747][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:06:51,406][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:06:52,065][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:06:52,724][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:06:53,385][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:06:54,044][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:06:54,703][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:06:55,362][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:06:56,022][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:06:56,682][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:06:57,341][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:06:58,002][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:06:58,661][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:06:59,321][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:06:59,980][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:07:00,639][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:07:01,298][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:07:01,958][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:07:02,617][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:07:03,276][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:07:03,935][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:07:04,595][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:07:05,254][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:07:05,913][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:07:06,572][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:07:07,231][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:07:07,891][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:07:08,552][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:07:09,539][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:07:10,200][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:07:10,859][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:07:11,518][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:07:12,177][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:07:12,835][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:07:13,495][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:07:14,153][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:07:14,812][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:07:15,470][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:07:16,130][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:07:16,789][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:07:17,448][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:07:18,105][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:07:18,766][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:07:19,423][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:07:20,082][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:07:20,789][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 19:07:22,183][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:07:22,830][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:07:22,831][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:07:24,185][__main__][INFO] - Iteration 325 took 53s (9.58% Gen, 87.86% Train). Generation: 5s, Training: 46s. Estimated remaining time: 9h 42m 40s. Estimated total time: 14h 43m 59s. Time estimates for 10 more iterations: 8m 50s, 100 more iterations: 1h 28m 23s, 500 more iterations: 7h 21m 59s. [2026-03-25 19:07:24,188][__main__][INFO] - Starting iteration 325. [2026-03-25 19:07:24,192][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 19:07:24,192][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:07:29,462][__main__][INFO] - Number of regex retries in iteration 325: 0 [2026-03-25 19:07:29,463][__main__][INFO] - agents played in iteration 325 are Bob, Alice [2026-03-25 19:07:30,026][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:07:30,087][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:07:30,088][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:07:30,088][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:07:30,862][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:07:31,494][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:07:32,155][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:07:32,814][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:07:33,475][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:07:34,134][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:07:34,793][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:07:35,452][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:07:36,112][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:07:36,770][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:07:37,429][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:07:38,087][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:07:38,748][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:07:39,407][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:07:40,067][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:07:40,728][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:07:43,088][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:07:43,744][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:07:44,401][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:07:45,059][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:07:45,718][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:07:46,375][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:07:47,034][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:07:47,693][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:07:48,351][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:07:49,012][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:07:49,673][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:07:50,330][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:07:50,988][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:07:51,646][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:07:52,307][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:07:52,964][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:07:53,622][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:07:54,281][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:07:54,939][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:07:55,596][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:07:56,255][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:07:56,913][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:07:57,570][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:07:58,228][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:07:58,887][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:07:59,546][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:08:00,204][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:08:00,862][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:08:03,503][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:08:04,160][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:08:04,819][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:08:05,477][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:08:06,460][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:08:07,118][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:08:07,776][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:08:08,435][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:08:09,093][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:08:09,751][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:08:10,410][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:08:11,069][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:08:11,727][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:08:12,386][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:08:13,044][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:08:13,702][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:08:14,360][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:08:15,018][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:08:15,677][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:08:16,337][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:08:16,997][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:08:17,790][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:46 [2026-03-25 19:08:19,201][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:08:19,204][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:08:19,205][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:08:20,705][__main__][INFO] - Iteration 326 took 56s (9.33% Gen, 88.02% Train). Generation: 5s, Training: 49s. Estimated remaining time: 10h 39m 39s. Estimated total time: 15h 41m 54s. Time estimates for 10 more iterations: 9m 25s, 100 more iterations: 1h 34m 11s, 500 more iterations: 7h 50m 57s. [2026-03-25 19:08:20,707][__main__][INFO] - Starting iteration 326. [2026-03-25 19:08:20,711][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 19:08:20,711][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:08:26,552][__main__][INFO] - Number of regex retries in iteration 326: 0 [2026-03-25 19:08:26,556][__main__][INFO] - agents played in iteration 326 are Bob, Alice [2026-03-25 19:08:27,598][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:08:27,660][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:08:27,661][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:08:27,661][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:08:28,473][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:08:29,088][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:08:29,748][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:08:30,407][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:08:31,066][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:08:31,725][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:08:32,382][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:08:33,041][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:08:33,700][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:08:34,360][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:08:35,019][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:08:35,679][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:08:36,338][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:08:36,996][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:08:37,658][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:08:38,317][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:08:38,977][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:08:39,636][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:08:40,296][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:08:40,954][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:08:41,613][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:08:42,272][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:08:42,931][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:08:43,589][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:08:44,248][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:08:44,906][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:08:45,565][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:08:46,223][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:08:46,881][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:08:47,540][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:08:48,198][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:08:48,856][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:08:49,514][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:08:50,173][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:08:50,831][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:08:51,494][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:08:52,155][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:08:52,812][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:08:53,474][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:08:54,133][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:08:54,792][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:08:55,451][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:08:56,110][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:08:56,768][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:08:57,426][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:08:58,084][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:08:58,743][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:08:59,402][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:09:00,408][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:09:01,068][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:09:01,727][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:09:02,386][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:09:03,046][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:09:03,704][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:09:04,363][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:09:05,022][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:09:05,682][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:09:06,341][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:09:07,000][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:09:07,658][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:09:08,317][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:09:08,974][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:09:09,633][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:09:10,291][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:09:10,950][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:09:11,845][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 19:09:13,214][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:09:13,217][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:09:13,218][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:09:14,776][__main__][INFO] - Iteration 327 took 54s (10.81% Gen, 86.30% Train). Generation: 5s, Training: 46s. Estimated remaining time: 9h 57m 57s. Estimated total time: 15h 1m 7s. Time estimates for 10 more iterations: 9m 0s, 100 more iterations: 1h 30m 6s, 500 more iterations: 7h 30m 33s. [2026-03-25 19:09:14,780][__main__][INFO] - Starting iteration 327. [2026-03-25 19:09:14,787][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 19:09:14,788][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:09:20,640][__main__][INFO] - Number of regex retries in iteration 327: 0 [2026-03-25 19:09:20,641][__main__][INFO] - agents played in iteration 327 are Bob, Alice [2026-03-25 19:09:21,238][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:09:21,304][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:09:21,305][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:09:21,306][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:09:21,966][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:09:22,582][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:09:23,243][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:09:23,902][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:09:24,562][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:09:25,222][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:09:25,881][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:09:26,542][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:09:27,201][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:09:27,861][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:09:28,520][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:09:29,180][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:09:29,840][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:09:30,499][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:09:31,159][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:09:31,818][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:09:32,478][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:09:33,138][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:09:33,797][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:09:34,456][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:09:35,117][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:09:35,778][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:09:36,436][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:09:37,095][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:09:37,755][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:09:38,415][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:09:39,075][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:09:39,736][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:09:40,397][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:09:41,057][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:09:41,717][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:09:42,376][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:09:43,036][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:09:43,695][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:09:44,356][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:09:45,016][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:09:45,675][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:09:46,335][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:09:46,994][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:09:47,655][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:09:48,312][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:09:48,972][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:09:49,633][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:09:50,292][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:09:50,952][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:09:51,612][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:09:52,273][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:09:52,933][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:09:53,911][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:09:54,570][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:09:55,230][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:09:55,889][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:09:56,547][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:09:57,205][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:09:57,866][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:09:58,525][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:09:59,185][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:09:59,844][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:10:00,503][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:10:01,161][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:10:01,820][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:10:02,479][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:10:03,139][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:10:03,799][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:10:04,458][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:10:05,226][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 19:10:06,610][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:10:06,613][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:10:06,614][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:10:08,146][__main__][INFO] - Iteration 328 took 53s (10.97% Gen, 86.15% Train). Generation: 5s, Training: 45s. Estimated remaining time: 9h 45m 17s. Estimated total time: 14h 49m 20s. Time estimates for 10 more iterations: 8m 53s, 100 more iterations: 1h 28m 56s, 500 more iterations: 7h 24m 40s. [2026-03-25 19:10:08,149][__main__][INFO] - Starting iteration 328. [2026-03-25 19:10:08,153][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 19:10:08,153][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:10:14,049][__main__][INFO] - Number of regex retries in iteration 328: 0 [2026-03-25 19:10:14,051][__main__][INFO] - agents played in iteration 328 are Bob, Alice [2026-03-25 19:10:14,777][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:10:14,839][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:10:14,840][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:10:14,840][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:10:15,548][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:10:16,165][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:10:16,824][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:10:17,483][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:10:18,142][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:10:18,810][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:10:19,470][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:10:20,129][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:10:20,787][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:10:21,446][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:10:22,104][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:10:22,763][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:10:23,421][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:10:24,079][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:10:24,738][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:10:25,396][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:10:26,056][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:10:26,714][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:10:27,372][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:10:28,031][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:10:28,690][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:10:29,349][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:10:30,007][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:10:30,666][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:10:31,325][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:10:31,985][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:10:32,643][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:10:33,302][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:10:33,961][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:10:34,619][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:10:35,278][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:10:35,937][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:10:36,595][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:10:37,253][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:10:37,911][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:10:38,569][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:10:39,228][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:10:39,886][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:10:40,546][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:10:41,204][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:10:41,863][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:10:42,521][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:10:43,179][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:10:43,839][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:10:44,498][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:10:45,157][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:10:45,815][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:10:46,475][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:10:47,495][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:10:48,155][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:10:48,815][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:10:49,475][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:10:50,136][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:10:50,797][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:10:51,459][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:10:52,119][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:10:52,780][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:10:53,438][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:10:54,098][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:10:54,758][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:10:55,420][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:10:56,079][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:10:56,739][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:10:57,398][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:10:58,057][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:10:58,790][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 19:11:00,079][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:11:00,082][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:11:00,083][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:11:04,033][__main__][INFO] - Iteration 329 took 55s (10.55% Gen, 82.37% Train). Generation: 5s, Training: 46s. Estimated remaining time: 10h 26m 23s. Estimated total time: 15h 31m 22s. Time estimates for 10 more iterations: 9m 18s, 100 more iterations: 1h 33m 8s, 500 more iterations: 7h 45m 41s. [2026-03-25 19:11:04,036][__main__][INFO] - Starting iteration 329. [2026-03-25 19:11:04,040][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 19:11:04,040][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:11:09,642][__main__][INFO] - Number of regex retries in iteration 329: 0 [2026-03-25 19:11:09,643][__main__][INFO] - agents played in iteration 329 are Bob, Alice [2026-03-25 19:11:10,714][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:11:10,775][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:11:10,776][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:11:10,776][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:11:11,538][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:11:12,147][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:11:12,808][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:11:13,468][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:11:14,127][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:11:14,786][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:11:15,445][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:11:16,104][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:11:16,765][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:11:17,424][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:11:18,085][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:11:18,744][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:11:19,403][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:11:20,062][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:11:20,721][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:11:21,381][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:11:22,040][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:11:22,699][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:11:23,359][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:11:24,018][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:11:24,678][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:11:25,338][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:11:25,997][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:11:26,655][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:11:27,314][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:11:27,972][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:11:28,632][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:11:29,292][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:11:29,951][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:11:30,611][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:11:31,271][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:11:31,932][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:11:32,592][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:11:33,252][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:11:33,912][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:11:34,570][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:11:35,229][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:11:35,888][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:11:36,553][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:11:37,211][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:11:37,871][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:11:38,530][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:11:39,190][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:11:39,850][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:11:40,509][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:11:41,168][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:11:41,828][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:11:42,490][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:11:43,479][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:11:44,138][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:11:44,797][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:11:45,458][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:11:46,117][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:11:46,775][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:11:47,434][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:11:48,094][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:11:48,753][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:11:49,414][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:11:50,070][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:11:50,729][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:11:51,389][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:11:52,047][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:11:52,707][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:11:53,365][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:11:54,025][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:11:54,958][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 19:11:56,367][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:11:56,370][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:11:56,401][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:11:58,106][__main__][INFO] - Iteration 330 took 54s (10.36% Gen, 86.48% Train). Generation: 5s, Training: 46s. Estimated remaining time: 9h 55m 16s. Estimated total time: 15h 1m 8s. Time estimates for 10 more iterations: 9m 0s, 100 more iterations: 1h 30m 6s, 500 more iterations: 7h 30m 34s. [2026-03-25 19:11:58,109][__main__][INFO] - Starting iteration 330. [2026-03-25 19:11:58,117][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 19:11:58,117][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:12:04,588][__main__][INFO] - Number of regex retries in iteration 330: 0 [2026-03-25 19:12:04,589][__main__][INFO] - agents played in iteration 330 are Bob, Alice [2026-03-25 19:12:05,359][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:12:05,427][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:12:05,428][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:12:05,429][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:12:06,151][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:12:06,778][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:12:07,440][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:12:08,102][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:12:08,764][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:12:09,425][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:12:10,085][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:12:10,742][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:12:11,402][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:12:12,062][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:12:12,727][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:12:13,388][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:12:14,050][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:12:14,712][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:12:15,373][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:12:16,036][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:12:16,697][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:12:17,358][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:12:18,019][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:12:18,680][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:12:19,342][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:12:20,003][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:12:20,667][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:12:21,326][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:12:21,986][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:12:22,646][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:12:23,308][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:12:23,969][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:12:24,629][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:12:25,290][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:12:25,951][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:12:26,615][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:12:27,276][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:12:27,938][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:12:28,598][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:12:29,258][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:12:29,918][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:12:30,578][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:12:31,239][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:12:31,901][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:12:32,559][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:12:33,219][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:12:33,879][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:12:34,538][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:12:35,200][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:12:35,857][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:12:36,517][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:12:37,177][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:12:38,174][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:12:38,833][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:12:39,493][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:12:40,152][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:12:40,812][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:12:41,469][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:12:42,129][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:12:42,788][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:12:43,446][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:12:44,107][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:12:44,765][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:12:45,425][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:12:46,085][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:12:46,744][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:12:47,407][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:12:48,067][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:12:48,727][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:12:49,623][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 19:12:51,051][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:12:51,055][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:12:51,056][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:12:52,453][__main__][INFO] - Iteration 331 took 54s (11.91% Gen, 85.51% Train). Generation: 6s, Training: 46s. Estimated remaining time: 9h 58m 51s. Estimated total time: 15h 5m 38s. Time estimates for 10 more iterations: 9m 3s, 100 more iterations: 1h 30m 33s, 500 more iterations: 7h 32m 49s. [2026-03-25 19:12:52,455][__main__][INFO] - Starting iteration 331. [2026-03-25 19:12:52,461][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 19:12:52,462][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:12:58,091][__main__][INFO] - Number of regex retries in iteration 331: 0 [2026-03-25 19:12:58,092][__main__][INFO] - agents played in iteration 331 are Bob, Alice [2026-03-25 19:12:58,792][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:12:58,855][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:12:58,856][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:12:58,856][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:12:59,611][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:13:00,223][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:13:00,884][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:13:01,548][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:13:02,206][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:13:02,864][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:13:03,526][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:13:04,186][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:13:04,849][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:13:05,510][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:13:06,171][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:13:06,835][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:13:07,491][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:13:08,152][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:13:08,812][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:13:09,471][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:13:10,129][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:13:10,788][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:13:11,448][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:13:12,107][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:13:13,703][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:13:16,125][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:13:16,783][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:13:17,441][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:13:18,099][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:13:18,758][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:13:19,416][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:13:20,076][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:13:20,735][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:13:21,394][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:13:22,052][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:13:22,710][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:13:23,370][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:13:26,597][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:13:27,256][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:13:27,913][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:13:28,574][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:13:29,235][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:13:29,894][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:13:30,554][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:13:31,215][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:13:31,876][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:13:32,535][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:13:33,195][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:13:33,856][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:13:34,632][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:13:37,284][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:13:37,943][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:13:38,913][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:13:39,573][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:13:40,229][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:13:40,887][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:13:41,546][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:13:42,206][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:13:42,864][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:13:43,525][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:13:44,183][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:13:44,842][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:13:45,500][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:13:46,158][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:13:46,817][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:13:47,477][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:13:48,136][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:13:48,794][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:13:49,452][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:13:50,222][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:50 [2026-03-25 19:13:51,629][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:13:51,632][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:13:51,633][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:13:53,116][__main__][INFO] - Iteration 332 took 1m 0s (9.28% Gen, 88.27% Train). Generation: 5s, Training: 53s. Estimated remaining time: 11h 43m 8s. Estimated total time: 16h 50m 56s. Time estimates for 10 more iterations: 10m 6s, 100 more iterations: 1h 41m 5s, 500 more iterations: 8h 25m 28s. [2026-03-25 19:13:53,119][__main__][INFO] - Starting iteration 332. [2026-03-25 19:13:53,122][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 19:13:53,123][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:13:58,763][__main__][INFO] - Number of regex retries in iteration 332: 0 [2026-03-25 19:13:58,764][__main__][INFO] - agents played in iteration 332 are Bob, Alice [2026-03-25 19:14:00,194][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:14:00,256][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:14:00,256][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:14:00,257][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:14:01,158][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:14:01,771][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:14:02,432][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:14:03,092][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:14:03,752][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:14:04,414][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:14:05,073][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:14:05,733][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:14:08,344][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:14:09,003][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:14:09,663][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:14:10,323][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:14:10,982][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:14:11,642][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:14:12,301][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:14:12,960][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:14:13,619][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:14:14,277][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:14:14,937][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:14:15,596][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:14:16,254][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:14:16,913][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:14:17,573][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:14:18,233][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:14:18,891][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:14:19,550][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:14:20,210][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:14:20,870][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:14:21,530][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:14:22,189][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:14:22,848][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:14:23,506][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:14:24,165][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:14:24,824][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:14:25,483][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:14:26,142][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:14:26,801][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:14:27,461][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:14:28,562][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:14:29,220][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:14:29,878][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:14:30,537][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:14:31,196][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:14:31,854][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:14:32,512][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:14:33,170][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:14:33,828][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:14:34,486][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:14:35,471][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:14:36,132][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:14:36,790][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:14:37,450][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:14:38,109][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:14:38,766][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:14:39,426][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:14:40,085][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:14:40,744][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:14:41,403][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:14:42,062][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:14:42,720][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:14:43,378][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:14:44,039][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:14:44,697][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:14:45,356][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:14:46,014][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:14:46,787][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:45 [2026-03-25 19:14:48,168][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:14:48,171][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:14:48,172][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:14:49,560][__main__][INFO] - Iteration 333 took 56s (10.00% Gen, 87.54% Train). Generation: 5s, Training: 49s. Estimated remaining time: 10h 31m 55s. Estimated total time: 15h 40m 39s. Time estimates for 10 more iterations: 9m 24s, 100 more iterations: 1h 34m 3s, 500 more iterations: 7h 50m 19s. [2026-03-25 19:14:49,562][__main__][INFO] - Starting iteration 333. [2026-03-25 19:14:49,566][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 19:14:49,566][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:14:55,073][__main__][INFO] - Number of regex retries in iteration 333: 0 [2026-03-25 19:14:55,074][__main__][INFO] - agents played in iteration 333 are Bob, Alice [2026-03-25 19:14:55,715][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:14:55,785][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:14:55,786][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:14:55,786][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:14:56,543][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:14:57,151][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:14:57,813][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:14:58,473][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:14:59,135][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:14:59,795][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:15:00,455][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:15:01,118][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:15:01,783][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:15:02,441][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:15:03,102][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:15:03,763][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:15:04,424][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:15:05,087][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:15:05,746][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:15:06,408][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:15:07,070][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:15:07,727][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:15:08,387][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:15:09,048][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:15:09,708][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:15:10,367][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:15:11,025][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:15:11,686][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:15:12,346][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:15:13,005][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:15:13,666][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:15:14,325][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:15:14,985][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:15:15,644][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:15:16,304][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:15:16,963][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:15:17,623][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:15:18,282][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:15:18,941][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:15:19,602][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:15:20,261][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:15:20,921][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:15:21,580][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:15:22,239][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:15:22,898][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:15:23,559][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:15:24,218][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:15:24,877][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:15:25,537][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:15:26,196][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:15:26,856][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:15:27,515][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:15:28,493][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:15:29,154][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:15:29,812][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:15:30,471][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:15:31,130][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:15:31,788][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:15:32,446][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:15:33,107][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:15:33,766][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:15:34,424][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:15:35,083][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:15:35,741][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:15:36,400][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:15:37,058][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:15:37,717][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:15:38,376][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:15:39,034][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:15:39,808][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 19:15:41,455][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:15:41,458][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:15:41,468][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:15:43,003][__main__][INFO] - Iteration 334 took 53s (10.31% Gen, 86.82% Train). Generation: 5s, Training: 46s. Estimated remaining time: 9h 41m 1s. Estimated total time: 14h 50m 38s. Time estimates for 10 more iterations: 8m 54s, 100 more iterations: 1h 29m 3s, 500 more iterations: 7h 25m 19s. [2026-03-25 19:15:43,005][__main__][INFO] - Starting iteration 334. [2026-03-25 19:15:43,009][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 19:15:43,010][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:15:48,105][__main__][INFO] - Number of regex retries in iteration 334: 0 [2026-03-25 19:15:48,106][__main__][INFO] - agents played in iteration 334 are Bob, Alice [2026-03-25 19:15:48,629][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:15:48,691][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:15:48,692][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:15:48,692][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:15:49,517][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:15:50,140][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:15:50,801][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:15:51,461][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:15:52,120][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:15:52,779][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:15:53,440][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:15:54,099][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:15:54,758][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:15:55,417][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:15:56,077][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:15:56,736][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:15:57,395][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:15:58,054][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:15:58,715][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:15:59,374][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:16:00,033][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:16:00,692][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:16:01,352][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:16:02,012][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:16:02,671][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:16:03,330][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:16:03,990][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:16:04,648][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:16:05,307][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:16:05,966][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:16:06,625][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:16:07,284][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:16:07,943][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:16:08,602][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:16:09,261][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:16:09,920][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:16:10,581][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:16:11,240][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:16:11,899][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:16:12,559][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:16:13,219][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:16:13,879][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:16:14,537][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:16:15,196][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:16:15,856][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:16:16,516][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:16:17,174][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:16:17,833][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:16:18,492][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:16:19,151][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:16:19,815][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:16:20,472][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:16:21,475][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:16:22,135][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:16:22,793][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:16:23,452][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:16:24,110][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:16:24,768][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:16:25,426][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:16:26,086][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:16:26,744][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:16:27,403][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:16:28,061][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:16:28,721][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:16:29,379][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:16:30,039][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:16:30,697][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:16:31,356][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:16:32,015][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:16:32,805][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 19:16:34,197][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:16:34,200][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:16:34,201][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:16:37,376][__main__][INFO] - Iteration 335 took 54s (9.37% Gen, 84.78% Train). Generation: 5s, Training: 46s. Estimated remaining time: 9h 55m 37s. Estimated total time: 15h 6m 9s. Time estimates for 10 more iterations: 9m 3s, 100 more iterations: 1h 30m 36s, 500 more iterations: 7h 33m 4s. [2026-03-25 19:16:37,379][__main__][INFO] - Starting iteration 335. [2026-03-25 19:16:37,383][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 19:16:37,383][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:16:42,507][__main__][INFO] - Number of regex retries in iteration 335: 0 [2026-03-25 19:16:42,508][__main__][INFO] - agents played in iteration 335 are Bob, Alice [2026-03-25 19:16:43,516][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:16:43,576][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:16:43,576][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:16:43,577][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:16:44,220][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:16:44,824][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:16:45,484][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:16:46,143][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:16:46,802][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:16:47,460][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:16:48,120][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:16:48,778][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:16:49,437][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:16:50,097][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:16:50,758][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:16:51,418][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:16:52,079][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:16:52,737][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:16:53,397][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:16:54,056][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:16:54,716][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:16:55,375][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:16:56,873][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:16:57,532][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:16:58,190][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:16:58,848][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:16:59,506][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:17:00,163][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:17:00,821][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:17:01,479][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:17:02,137][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:17:02,795][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:17:03,454][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:17:04,111][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:17:04,770][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:17:05,430][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:17:06,087][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:17:06,745][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:17:07,403][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:17:08,060][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:17:08,718][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:17:09,377][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:17:10,036][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:17:10,694][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:17:11,351][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:17:12,010][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:17:12,667][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:17:13,325][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:17:13,983][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:17:14,642][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:17:15,300][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:17:15,958][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:17:16,937][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:17:17,597][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:17:18,256][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:17:18,915][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:17:19,573][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:17:20,232][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:17:20,890][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:17:21,548][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:17:22,207][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:17:22,866][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:17:23,525][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:17:24,186][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:17:24,848][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:17:25,505][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:17:26,163][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:17:26,822][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:17:27,481][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:17:28,296][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:44 [2026-03-25 19:17:29,413][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:17:29,416][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:17:29,418][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:17:30,730][__main__][INFO] - Iteration 336 took 53s (9.60% Gen, 87.93% Train). Generation: 5s, Training: 46s. Estimated remaining time: 9h 37m 43s. Estimated total time: 14h 49m 9s. Time estimates for 10 more iterations: 8m 53s, 100 more iterations: 1h 28m 54s, 500 more iterations: 7h 24m 34s. [2026-03-25 19:17:30,732][__main__][INFO] - Starting iteration 336. [2026-03-25 19:17:30,740][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 19:17:30,740][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:17:37,673][__main__][INFO] - Number of regex retries in iteration 336: 0 [2026-03-25 19:17:37,674][__main__][INFO] - agents played in iteration 336 are Bob, Alice [2026-03-25 19:17:38,219][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:17:38,280][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:17:38,281][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:17:38,282][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:17:38,932][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:17:39,537][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:17:40,198][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:17:40,858][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:17:41,517][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:17:42,179][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:17:42,839][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:17:43,498][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:17:44,159][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:17:44,817][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:17:45,477][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:17:46,138][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:17:46,800][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:17:47,460][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:17:48,120][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:17:48,779][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:17:49,440][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:17:50,102][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:17:50,761][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:17:51,420][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:17:52,079][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:17:52,738][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:17:53,400][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:17:54,059][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:17:54,719][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:17:55,380][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:17:56,040][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:17:56,700][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:17:57,360][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:17:58,020][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:17:58,681][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:17:59,341][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:18:00,002][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:18:00,665][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:18:01,324][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:18:01,983][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:18:02,643][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:18:03,302][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:18:03,962][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:18:04,621][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:18:05,281][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:18:05,943][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:18:06,601][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:18:07,260][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:18:07,923][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:18:08,583][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:18:09,240][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:18:09,899][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:18:10,876][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:18:11,536][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:18:12,197][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:18:12,853][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:18:13,512][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:18:14,172][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:18:14,830][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:18:15,489][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:18:16,148][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:18:16,807][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:18:17,466][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:18:18,126][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:18:18,786][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:18:19,446][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:18:20,106][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:18:20,768][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:18:21,425][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:18:22,218][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 19:18:23,622][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:18:23,625][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:18:23,626][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:18:26,955][__main__][INFO] - Iteration 337 took 56s (12.33% Gen, 81.74% Train). Generation: 6s, Training: 45s. Estimated remaining time: 10h 24m 34s. Estimated total time: 15h 36m 56s. Time estimates for 10 more iterations: 9m 22s, 100 more iterations: 1h 33m 41s, 500 more iterations: 7h 48m 28s. [2026-03-25 19:18:26,957][__main__][INFO] - Starting iteration 337. [2026-03-25 19:18:26,961][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 19:18:26,961][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:18:32,115][__main__][INFO] - Number of regex retries in iteration 337: 0 [2026-03-25 19:18:32,116][__main__][INFO] - agents played in iteration 337 are Bob, Alice [2026-03-25 19:18:32,669][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:18:32,731][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:18:32,732][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:18:32,732][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:18:33,395][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:18:34,008][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:18:34,668][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:18:35,328][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:18:35,989][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:18:36,646][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:18:37,308][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:18:37,969][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:18:38,630][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:18:39,290][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:18:39,949][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:18:40,609][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:18:41,270][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:18:41,930][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:18:42,589][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:18:43,250][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:18:43,908][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:18:44,568][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:18:46,399][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:18:47,061][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:18:47,722][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:18:48,386][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:18:49,051][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:18:49,713][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:18:50,376][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:18:51,040][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:18:51,703][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:18:52,366][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:18:53,030][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:18:53,692][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:18:54,356][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:18:55,020][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:18:55,682][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:18:56,341][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:18:57,001][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:18:57,659][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:18:58,319][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:18:58,980][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:18:59,638][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:19:00,299][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:19:00,957][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:19:01,618][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:19:02,278][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:19:02,938][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:19:03,598][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:19:04,258][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:19:04,918][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:19:05,578][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:19:06,556][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:19:07,215][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:19:07,874][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:19:08,534][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:19:09,193][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:19:09,857][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:19:10,516][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:19:11,175][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:19:11,834][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:19:12,495][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:19:13,154][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:19:13,814][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:19:14,475][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:19:15,136][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:19:15,795][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:19:16,455][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:19:17,114][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:19:17,905][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:44 [2026-03-25 19:19:19,306][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:19:19,309][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:19:19,310][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:19:20,731][__main__][INFO] - Iteration 338 took 53s (9.59% Gen, 87.77% Train). Generation: 5s, Training: 47s. Estimated remaining time: 9h 42m 56s. Estimated total time: 14h 56m 11s. Time estimates for 10 more iterations: 8m 57s, 100 more iterations: 1h 29m 37s, 500 more iterations: 7h 28m 5s. [2026-03-25 19:19:20,733][__main__][INFO] - Starting iteration 338. [2026-03-25 19:19:20,738][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 19:19:20,738][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:19:26,113][__main__][INFO] - Number of regex retries in iteration 338: 0 [2026-03-25 19:19:26,113][__main__][INFO] - agents played in iteration 338 are Bob, Alice [2026-03-25 19:19:26,741][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:19:26,813][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:19:26,813][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:19:26,814][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:19:27,518][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:19:28,147][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:19:28,807][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:19:29,467][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:19:30,126][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:19:30,787][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:19:31,447][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:19:32,106][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:19:32,765][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:19:33,426][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:19:34,084][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:19:34,745][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:19:35,404][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:19:36,065][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:19:36,725][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:19:37,386][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:19:38,047][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:19:39,550][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:19:40,206][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:19:40,866][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:19:41,525][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:19:42,184][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:19:42,843][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:19:43,501][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:19:44,160][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:19:44,818][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:19:45,478][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:19:46,138][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:19:46,798][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:19:47,458][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:19:48,116][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:19:48,775][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:19:49,435][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:19:50,095][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:19:50,753][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:19:51,413][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:19:52,071][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:19:52,732][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:19:53,392][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:19:54,052][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:19:54,711][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:19:55,370][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:19:56,029][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:19:56,689][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:19:57,346][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:19:58,006][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:19:58,666][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:19:59,325][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:20:00,322][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:20:00,983][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:20:01,646][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:20:02,304][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:20:02,962][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:20:03,621][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:20:04,281][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:20:04,940][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:20:05,599][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:20:06,258][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:20:06,919][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:20:07,579][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:20:08,242][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:20:08,904][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:20:09,565][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:20:10,223][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:20:10,882][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:20:11,665][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:44 [2026-03-25 19:20:13,159][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:20:13,162][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:20:13,173][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:20:14,621][__main__][INFO] - Iteration 339 took 53s (9.98% Gen, 87.33% Train). Generation: 5s, Training: 47s. Estimated remaining time: 9h 43m 56s. Estimated total time: 14h 58m 5s. Time estimates for 10 more iterations: 8m 58s, 100 more iterations: 1h 29m 48s, 500 more iterations: 7h 29m 2s. [2026-03-25 19:20:14,623][__main__][INFO] - Starting iteration 339. [2026-03-25 19:20:14,627][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 19:20:14,628][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:20:22,297][__main__][INFO] - Number of regex retries in iteration 339: 0 [2026-03-25 19:20:22,298][__main__][INFO] - agents played in iteration 339 are Bob, Alice [2026-03-25 19:20:23,105][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:20:23,166][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:20:23,167][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:20:23,167][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:20:23,927][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:20:24,534][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:20:25,195][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:20:25,854][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:20:26,515][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:20:27,174][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:20:27,833][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:20:28,493][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:20:29,155][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:20:29,815][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:20:30,474][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:20:31,133][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:20:31,792][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:20:32,453][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:20:33,112][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:20:33,772][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:20:34,431][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:20:35,092][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:20:35,750][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:20:36,409][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:20:37,068][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:20:37,727][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:20:38,386][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:20:39,044][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:20:39,704][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:20:40,364][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:20:41,023][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:20:41,682][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:20:42,341][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:20:43,000][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:20:43,659][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:20:44,318][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:20:44,977][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:20:45,636][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:20:46,296][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:20:46,957][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:20:47,614][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:20:48,274][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:20:48,933][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:20:49,593][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:20:50,252][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:20:50,911][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:20:51,571][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:20:52,230][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:20:52,890][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:20:53,549][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:20:54,208][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:20:54,868][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:20:55,851][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:20:56,511][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:20:57,171][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:20:57,829][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:20:58,487][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:20:59,146][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:20:59,805][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:21:00,464][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:21:01,123][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:21:01,783][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:21:02,443][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:21:03,102][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:21:03,760][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:21:04,420][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:21:05,079][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:21:05,739][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:21:07,327][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:21:08,093][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:44 [2026-03-25 19:21:09,514][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:21:09,517][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:21:09,518][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:21:10,936][__main__][INFO] - Iteration 340 took 56s (13.62% Gen, 83.86% Train). Generation: 7s, Training: 47s. Estimated remaining time: 10h 23m 25s. Estimated total time: 15h 38m 30s. Time estimates for 10 more iterations: 9m 23s, 100 more iterations: 1h 33m 51s, 500 more iterations: 7h 49m 15s. [2026-03-25 19:21:10,938][__main__][INFO] - Starting iteration 340. [2026-03-25 19:21:10,944][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 19:21:10,945][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:21:16,563][__main__][INFO] - Number of regex retries in iteration 340: 0 [2026-03-25 19:21:16,565][__main__][INFO] - agents played in iteration 340 are Bob, Alice [2026-03-25 19:21:17,147][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:21:17,208][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:21:17,209][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:21:17,210][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:21:17,872][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:21:18,479][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:21:19,142][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:21:19,802][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:21:20,463][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:21:21,121][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:21:21,781][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:21:22,441][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:21:23,101][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:21:23,760][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:21:24,418][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:21:25,077][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:21:25,736][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:21:26,395][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:21:27,054][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:21:27,715][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:21:28,375][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:21:29,035][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:21:29,694][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:21:30,353][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:21:31,014][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:21:31,674][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:21:32,335][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:21:32,995][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:21:33,655][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:21:34,315][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:21:34,979][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:21:35,639][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:21:36,300][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:21:36,961][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:21:37,621][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:21:38,280][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:21:38,944][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:21:39,603][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:21:40,263][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:21:40,923][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:21:41,583][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:21:42,243][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:21:42,905][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:21:43,564][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:21:44,225][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:21:44,885][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:21:45,544][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:21:46,207][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:21:46,866][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:21:47,526][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:21:48,185][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:21:48,847][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:21:49,840][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:21:50,498][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:21:51,155][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:21:51,815][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:21:52,472][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:21:53,133][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:21:53,789][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:21:54,448][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:21:55,107][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:21:55,766][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:21:56,427][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:21:57,086][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:21:57,744][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:21:58,405][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:21:59,065][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:21:59,724][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:22:00,382][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:22:01,254][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 19:22:02,334][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:22:02,337][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:22:02,365][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:22:03,782][__main__][INFO] - Iteration 341 took 52s (10.64% Gen, 86.68% Train). Generation: 5s, Training: 45s. Estimated remaining time: 9h 24m 42s. Estimated total time: 14h 40m 41s. Time estimates for 10 more iterations: 8m 48s, 100 more iterations: 1h 28m 4s, 500 more iterations: 7h 20m 20s. [2026-03-25 19:22:03,785][__main__][INFO] - Starting iteration 341. [2026-03-25 19:22:03,789][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 19:22:03,789][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:22:08,927][__main__][INFO] - Number of regex retries in iteration 341: 0 [2026-03-25 19:22:08,928][__main__][INFO] - agents played in iteration 341 are Bob, Alice [2026-03-25 19:22:09,452][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:22:09,514][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:22:09,515][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:22:09,515][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:22:10,185][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:22:10,801][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:22:11,461][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:22:12,122][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:22:12,783][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:22:13,444][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:22:14,104][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:22:14,764][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:22:15,426][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:22:16,087][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:22:16,746][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:22:17,408][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:22:18,070][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:22:18,730][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:22:19,392][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:22:20,055][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:22:20,716][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:22:21,377][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:22:22,037][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:22:22,698][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:22:23,358][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:22:24,019][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:22:24,679][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:22:25,341][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:22:26,002][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:22:26,662][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:22:27,322][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:22:27,982][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:22:28,643][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:22:29,304][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:22:29,964][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:22:30,624][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:22:31,284][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:22:31,944][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:22:32,605][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:22:33,265][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:22:33,926][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:22:34,586][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:22:35,248][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:22:35,909][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:22:36,568][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:22:37,228][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:22:37,888][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:22:38,548][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:22:39,206][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:22:39,864][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:22:40,524][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:22:41,183][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:22:42,164][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:22:42,824][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:22:43,484][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:22:44,142][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:22:44,800][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:22:45,459][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:22:46,117][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:22:46,776][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:22:47,435][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:22:48,094][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:22:48,754][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:22:49,415][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:22:50,074][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:22:50,733][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:22:51,391][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:22:52,051][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:22:52,712][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:22:53,515][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 19:22:54,886][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:22:54,889][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:22:54,890][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:22:56,278][__main__][INFO] - Iteration 342 took 52s (9.79% Gen, 87.56% Train). Generation: 5s, Training: 45s. Estimated remaining time: 9h 17m 59s. Estimated total time: 14h 34m 50s. Time estimates for 10 more iterations: 8m 44s, 100 more iterations: 1h 27m 29s, 500 more iterations: 7h 17m 25s. [2026-03-25 19:22:56,280][__main__][INFO] - Starting iteration 342. [2026-03-25 19:22:56,296][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 19:22:56,296][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:23:01,269][__main__][INFO] - Number of regex retries in iteration 342: 0 [2026-03-25 19:23:01,269][__main__][INFO] - agents played in iteration 342 are Bob, Alice [2026-03-25 19:23:01,751][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:23:01,815][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:23:01,815][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:23:01,816][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:23:02,577][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:23:03,199][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:23:03,858][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:23:04,518][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:23:05,178][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:23:05,838][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:23:06,499][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:23:07,158][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:23:07,819][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:23:08,478][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:23:09,137][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:23:09,796][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:23:10,456][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:23:11,116][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:23:11,776][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:23:12,436][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:23:13,097][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:23:13,757][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:23:14,416][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:23:15,076][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:23:15,736][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:23:16,396][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:23:17,056][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:23:17,716][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:23:18,376][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:23:19,037][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:23:19,696][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:23:20,355][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:23:21,014][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:23:21,673][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:23:22,332][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:23:22,990][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:23:23,649][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:23:24,307][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:23:24,966][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:23:25,627][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:23:26,285][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:23:26,945][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:23:27,605][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:23:28,265][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:23:28,925][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:23:29,586][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:23:30,247][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:23:30,907][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:23:31,566][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:23:32,224][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:23:32,884][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:23:33,544][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:23:34,543][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:23:35,204][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:23:35,863][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:23:36,521][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:23:37,181][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:23:37,842][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:23:38,502][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:23:39,161][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:23:39,821][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:23:40,481][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:23:41,141][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:23:41,800][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:23:42,461][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:23:43,120][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:23:43,780][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:23:44,440][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:23:45,098][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:23:45,886][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 19:23:47,289][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:23:47,292][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:23:47,293][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:23:48,825][__main__][INFO] - Iteration 343 took 52s (9.47% Gen, 87.61% Train). Generation: 4s, Training: 46s. Estimated remaining time: 9h 17m 47s. Estimated total time: 14h 35m 30s. Time estimates for 10 more iterations: 8m 45s, 100 more iterations: 1h 27m 33s, 500 more iterations: 7h 17m 45s. [2026-03-25 19:23:48,827][__main__][INFO] - Starting iteration 343. [2026-03-25 19:23:48,831][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 19:23:48,832][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:23:55,317][__main__][INFO] - Number of regex retries in iteration 343: 0 [2026-03-25 19:23:55,318][__main__][INFO] - agents played in iteration 343 are Bob, Alice [2026-03-25 19:23:56,384][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:23:56,444][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:23:56,445][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:23:56,445][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:23:57,239][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:23:57,849][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:23:58,515][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:23:59,179][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:23:59,840][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:24:00,500][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:24:01,160][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:24:01,820][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:24:02,483][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:24:03,142][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:24:03,803][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:24:04,463][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:24:05,124][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:24:05,784][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:24:06,443][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:24:07,104][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:24:07,764][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:24:08,423][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:24:09,082][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:24:09,742][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:24:10,403][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:24:11,063][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:24:11,725][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:24:12,385][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:24:13,044][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:24:13,703][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:24:14,362][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:24:15,022][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:24:15,682][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:24:16,342][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:24:17,003][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:24:17,663][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:24:18,323][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:24:18,982][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:24:19,646][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:24:20,306][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:24:20,965][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:24:21,625][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:24:22,286][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:24:22,947][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:24:23,607][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:24:24,267][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:24:24,928][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:24:25,590][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:24:26,251][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:24:26,912][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:24:27,572][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:24:28,233][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:24:29,243][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:24:29,903][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:24:30,564][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:24:31,223][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:24:31,883][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:24:32,545][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:24:33,204][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:24:33,863][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:24:34,522][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:24:35,180][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:24:35,839][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:24:36,498][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:24:37,158][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:24:37,817][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:24:38,476][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:24:39,134][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:24:39,793][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:24:40,558][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 19:24:42,102][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:24:42,105][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:24:42,117][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:24:43,488][__main__][INFO] - Iteration 344 took 54s (11.87% Gen, 85.62% Train). Generation: 6s, Training: 46s. Estimated remaining time: 9h 52m 20s. Estimated total time: 15h 10m 58s. Time estimates for 10 more iterations: 9m 6s, 100 more iterations: 1h 31m 5s, 500 more iterations: 7h 35m 29s. [2026-03-25 19:24:43,490][__main__][INFO] - Starting iteration 344. [2026-03-25 19:24:43,494][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 19:24:43,494][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:24:48,495][__main__][INFO] - Number of regex retries in iteration 344: 0 [2026-03-25 19:24:48,496][__main__][INFO] - agents played in iteration 344 are Bob, Alice [2026-03-25 19:24:49,012][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:24:49,072][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:24:49,072][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:24:49,073][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:24:49,805][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:24:50,424][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:24:51,085][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:24:51,745][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:24:52,405][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:24:53,064][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:24:53,723][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:24:54,383][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:24:55,042][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:24:55,704][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:24:56,362][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:24:57,023][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:24:57,683][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:24:58,343][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:24:59,004][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:24:59,664][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:25:00,324][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:25:00,985][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:25:01,645][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:25:02,304][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:25:02,964][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:25:03,623][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:25:04,282][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:25:04,941][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:25:05,601][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:25:06,261][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:25:06,922][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:25:07,582][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:25:08,244][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:25:08,905][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:25:09,565][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:25:10,226][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:25:10,886][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:25:11,547][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:25:12,207][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:25:12,866][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:25:13,526][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:25:14,186][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:25:14,847][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:25:15,506][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:25:16,165][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:25:16,825][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:25:17,485][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:25:18,144][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:25:18,804][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:25:19,463][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:25:20,123][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:25:20,782][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:25:21,772][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:25:22,432][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:25:23,090][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:25:23,749][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:25:24,409][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:25:25,068][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:25:25,727][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:25:26,385][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:25:27,045][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:25:27,704][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:25:28,362][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:25:29,022][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:25:29,682][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:25:30,341][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:25:31,000][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:25:31,659][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:25:32,319][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:25:33,051][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 19:25:34,471][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:25:34,474][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:25:34,475][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:25:36,009][__main__][INFO] - Iteration 345 took 52s (9.52% Gen, 87.55% Train). Generation: 5s, Training: 45s. Estimated remaining time: 9h 15m 46s. Estimated total time: 14h 35m 17s. Time estimates for 10 more iterations: 8m 45s, 100 more iterations: 1h 27m 31s, 500 more iterations: 7h 17m 38s. [2026-03-25 19:25:36,011][__main__][INFO] - Starting iteration 345. [2026-03-25 19:25:36,014][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 19:25:36,015][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:25:41,282][__main__][INFO] - Number of regex retries in iteration 345: 0 [2026-03-25 19:25:41,283][__main__][INFO] - agents played in iteration 345 are Bob, Alice [2026-03-25 19:25:41,775][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:25:41,836][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:25:41,836][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:25:41,837][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:25:42,625][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:25:43,234][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:25:43,900][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:25:44,559][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:25:45,217][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:25:45,878][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:25:46,538][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:25:47,197][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:25:47,856][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:25:48,517][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:25:49,177][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:25:49,837][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:25:50,496][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:25:51,155][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:25:51,815][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:25:52,474][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:25:53,133][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:25:53,792][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:25:54,451][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:25:55,110][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:25:55,770][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:25:56,429][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:25:57,087][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:25:57,748][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:25:58,408][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:25:59,068][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:25:59,727][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:26:00,387][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:26:01,048][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:26:01,708][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:26:02,367][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:26:03,026][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:26:03,685][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:26:04,343][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:26:05,003][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:26:05,663][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:26:06,323][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:26:06,982][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:26:07,641][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:26:08,300][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:26:08,959][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:26:09,618][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:26:10,277][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:26:10,936][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:26:11,594][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:26:12,255][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:26:12,914][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:26:13,573][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:26:14,560][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:26:15,219][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:26:15,879][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:26:16,538][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:26:17,197][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:26:17,855][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:26:18,515][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:26:19,175][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:26:19,833][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:26:20,490][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:26:21,147][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:26:21,804][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:26:22,464][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:26:23,122][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:26:23,780][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:26:24,438][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:26:25,097][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:26:25,843][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 19:26:27,223][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:26:27,226][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:26:27,227][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:26:28,657][__main__][INFO] - Iteration 346 took 52s (10.01% Gen, 87.27% Train). Generation: 5s, Training: 45s. Estimated remaining time: 9h 17m 0s. Estimated total time: 14h 37m 24s. Time estimates for 10 more iterations: 8m 46s, 100 more iterations: 1h 27m 44s, 500 more iterations: 7h 18m 42s. [2026-03-25 19:26:28,662][__main__][INFO] - Starting iteration 346. [2026-03-25 19:26:28,668][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 19:26:28,669][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:26:34,728][__main__][INFO] - Number of regex retries in iteration 346: 0 [2026-03-25 19:26:34,728][__main__][INFO] - agents played in iteration 346 are Bob, Alice [2026-03-25 19:26:35,648][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:26:35,710][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:26:35,711][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:26:35,711][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:26:36,397][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:26:37,003][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:26:37,663][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:26:38,324][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:26:38,983][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:26:39,643][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:26:40,303][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:26:40,963][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:26:41,623][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:26:42,282][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:26:42,943][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:26:43,603][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:26:44,262][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:26:44,922][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:26:45,581][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:26:46,240][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:26:46,899][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:26:47,558][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:26:48,218][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:26:48,879][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:26:49,541][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:26:50,201][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:26:50,859][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:26:51,519][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:26:52,183][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:26:52,841][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:26:53,502][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:26:54,164][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:26:54,824][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:26:55,485][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:26:56,146][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:26:56,805][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:26:57,465][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:26:58,125][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:26:58,785][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:26:59,445][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:27:00,106][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:27:00,767][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:27:01,426][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:27:02,086][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:27:02,746][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:27:03,406][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:27:04,066][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:27:04,726][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:27:05,387][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:27:06,045][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:27:06,706][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:27:07,367][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:27:08,361][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:27:09,021][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:27:09,681][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:27:10,339][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:27:10,998][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:27:11,657][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:27:12,315][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:27:12,975][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:27:13,635][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:27:14,295][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:27:14,955][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:27:15,614][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:27:16,272][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:27:16,932][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:27:17,591][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:27:18,249][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:27:18,908][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:27:19,698][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 19:27:21,086][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:27:21,089][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:27:21,090][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:27:22,402][__main__][INFO] - Iteration 347 took 53s (11.28% Gen, 86.28% Train). Generation: 6s, Training: 46s. Estimated remaining time: 9h 34m 19s. Estimated total time: 14h 55m 36s. Time estimates for 10 more iterations: 8m 57s, 100 more iterations: 1h 29m 33s, 500 more iterations: 7h 27m 48s. [2026-03-25 19:27:22,405][__main__][INFO] - Starting iteration 347. [2026-03-25 19:27:22,408][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 19:27:22,408][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:27:28,835][__main__][INFO] - Number of regex retries in iteration 347: 0 [2026-03-25 19:27:28,837][__main__][INFO] - agents played in iteration 347 are Bob, Alice [2026-03-25 19:27:29,464][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:27:29,530][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:27:29,531][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:27:29,532][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:27:30,369][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:27:30,996][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:27:31,657][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:27:32,317][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:27:32,977][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:27:33,638][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:27:34,299][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:27:34,964][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:27:35,626][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:27:36,287][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:27:36,948][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:27:37,609][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:27:38,269][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:27:38,930][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:27:39,592][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:27:40,252][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:27:40,911][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:27:41,570][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:27:42,228][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:27:42,889][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:27:43,549][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:27:44,210][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:27:44,869][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:27:45,528][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:27:46,188][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:27:46,846][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:27:47,505][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:27:48,165][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:27:48,825][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:27:49,486][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:27:50,144][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:27:50,804][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:27:51,464][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:27:52,124][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:27:52,783][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:27:53,442][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:27:54,100][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:27:54,759][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:27:55,418][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:27:56,076][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:27:56,735][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:27:57,394][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:27:58,053][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:27:58,713][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:27:59,373][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:28:00,032][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:28:00,691][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:28:01,352][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:28:02,337][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:28:02,996][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:28:03,654][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:28:04,311][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:28:04,969][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:28:05,627][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:28:06,284][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:28:06,942][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:28:07,601][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:28:08,259][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:28:08,918][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:28:09,577][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:28:10,235][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:28:10,894][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:28:11,554][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:28:12,212][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:28:12,872][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:28:13,614][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 19:28:15,002][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:28:15,005][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:28:15,006][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:28:16,379][__main__][INFO] - Iteration 348 took 53s (11.91% Gen, 85.54% Train). Generation: 6s, Training: 46s. Estimated remaining time: 9h 37m 21s. Estimated total time: 14h 59m 32s. Time estimates for 10 more iterations: 8m 59s, 100 more iterations: 1h 29m 57s, 500 more iterations: 7h 29m 46s. [2026-03-25 19:28:16,381][__main__][INFO] - Starting iteration 348. [2026-03-25 19:28:16,386][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 19:28:16,386][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:28:21,456][__main__][INFO] - Number of regex retries in iteration 348: 0 [2026-03-25 19:28:21,457][__main__][INFO] - agents played in iteration 348 are Bob, Alice [2026-03-25 19:28:21,940][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:28:22,004][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:28:22,004][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:28:22,005][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:28:22,748][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:28:23,371][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:28:24,031][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:28:24,691][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:28:25,351][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:28:26,010][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:28:26,669][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:28:27,329][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:28:27,988][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:28:28,648][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:28:29,307][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:28:29,966][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:28:30,625][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:28:31,283][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:28:31,942][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:28:32,601][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:28:33,261][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:28:33,920][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:28:34,580][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:28:35,238][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:28:35,897][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:28:36,556][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:28:37,214][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:28:37,874][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:28:38,534][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:28:39,194][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:28:39,851][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:28:40,509][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:28:41,168][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:28:41,828][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:28:42,486][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:28:43,145][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:28:43,806][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:28:44,466][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:28:45,126][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:28:45,785][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:28:46,446][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:28:47,105][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:28:47,765][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:28:48,426][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:28:49,085][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:28:49,746][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:28:50,405][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:28:51,065][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:28:51,725][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:28:52,385][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:28:53,046][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:28:53,706][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:28:54,702][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:28:55,363][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:28:56,020][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:28:56,679][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:28:57,338][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:28:57,998][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:28:58,658][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:28:59,320][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:28:59,980][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:29:00,640][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:29:01,299][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:29:01,958][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:29:02,618][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:29:03,278][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:29:03,938][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:29:04,597][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:29:05,255][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:29:06,196][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 19:29:07,658][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:29:07,661][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:29:07,662][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:29:09,087][__main__][INFO] - Iteration 349 took 52s (9.62% Gen, 87.67% Train). Generation: 5s, Training: 46s. Estimated remaining time: 9h 15m 19s. Estimated total time: 14h 38m 23s. Time estimates for 10 more iterations: 8m 47s, 100 more iterations: 1h 27m 50s, 500 more iterations: 7h 19m 11s. [2026-03-25 19:29:09,090][__main__][INFO] - Starting iteration 349. [2026-03-25 19:29:09,094][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 19:29:09,095][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:29:13,988][__main__][INFO] - Number of regex retries in iteration 349: 0 [2026-03-25 19:29:13,989][__main__][INFO] - agents played in iteration 349 are Bob, Alice [2026-03-25 19:29:14,470][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:29:14,536][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:29:14,537][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:29:14,538][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:29:15,302][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:29:15,917][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:29:16,577][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:29:17,237][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:29:17,895][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:29:18,556][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:29:19,216][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:29:19,876][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:29:20,536][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:29:21,197][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:29:21,857][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:29:22,518][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:29:23,177][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:29:23,837][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:29:24,496][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:29:25,156][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:29:25,815][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:29:26,475][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:29:27,134][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:29:27,792][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:29:28,452][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:29:29,112][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:29:29,772][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:29:30,432][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:29:31,092][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:29:31,751][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:29:32,409][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:29:33,071][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:29:33,729][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:29:34,387][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:29:35,046][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:29:35,705][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:29:36,364][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:29:37,025][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:29:37,684][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:29:38,342][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:29:39,002][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:29:39,662][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:29:40,322][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:29:40,982][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:29:41,642][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:29:42,302][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:29:42,963][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:29:43,623][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:29:44,282][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:29:44,942][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:29:45,603][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:29:46,263][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:29:47,243][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:29:47,902][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:29:48,561][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:29:49,219][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:29:49,878][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:29:50,538][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:29:51,198][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:29:51,856][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:29:52,515][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:29:53,173][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:29:53,830][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:29:54,488][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:29:55,146][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:29:55,804][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:29:56,462][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:29:57,121][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:29:57,782][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:29:58,597][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 19:30:00,384][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:30:00,386][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:30:00,387][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:30:01,756][__main__][INFO] - Iteration 350 took 52s (9.29% Gen, 88.10% Train). Generation: 4s, Training: 46s. Estimated remaining time: 9h 13m 47s. Estimated total time: 14h 37m 43s. Time estimates for 10 more iterations: 8m 46s, 100 more iterations: 1h 27m 46s, 500 more iterations: 7h 18m 51s. [2026-03-25 19:30:01,758][__main__][INFO] - Starting iteration 350. [2026-03-25 19:30:01,763][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 19:30:01,764][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:30:07,128][__main__][INFO] - Number of regex retries in iteration 350: 0 [2026-03-25 19:30:07,129][__main__][INFO] - agents played in iteration 350 are Bob, Alice [2026-03-25 19:30:07,730][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:30:07,791][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:30:07,791][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:30:07,792][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:30:08,578][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:30:09,201][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:30:09,863][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:30:10,523][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:30:11,184][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:30:11,844][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:30:12,502][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:30:13,162][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:30:13,820][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:30:14,480][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:30:15,139][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:30:15,797][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:30:16,455][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:30:17,114][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:30:17,773][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:30:18,435][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:30:19,095][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:30:19,754][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:30:20,413][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:30:21,073][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:30:21,733][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:30:22,391][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:30:23,050][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:30:23,709][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:30:24,368][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:30:25,027][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:30:25,685][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:30:26,344][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:30:27,004][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:30:27,663][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:30:28,321][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:30:28,996][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:30:29,653][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:30:30,312][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:30:30,971][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:30:31,630][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:30:32,288][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:30:32,947][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:30:33,606][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:30:34,264][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:30:34,923][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:30:35,582][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:30:36,242][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:30:36,900][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:30:37,559][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:30:38,217][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:30:38,876][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:30:39,536][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:30:40,523][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:30:41,181][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:30:41,840][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:30:42,498][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:30:43,158][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:30:43,815][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:30:44,474][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:30:45,132][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:30:45,789][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:30:46,447][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:30:47,104][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:30:47,762][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:30:48,421][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:30:49,083][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:30:49,741][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:30:50,400][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:30:51,060][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:30:51,848][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 19:30:53,519][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:30:53,522][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:30:53,523][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:30:56,323][__main__][INFO] - Iteration 351 took 54s (9.83% Gen, 85.03% Train). Generation: 5s, Training: 46s. Estimated remaining time: 9h 44m 31s. Estimated total time: 15h 9m 22s. Time estimates for 10 more iterations: 9m 5s, 100 more iterations: 1h 30m 56s, 500 more iterations: 7h 34m 41s. [2026-03-25 19:30:56,325][__main__][INFO] - Starting iteration 351. [2026-03-25 19:30:56,329][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 19:30:56,330][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:31:01,969][__main__][INFO] - Number of regex retries in iteration 351: 0 [2026-03-25 19:31:01,971][__main__][INFO] - agents played in iteration 351 are Bob, Alice [2026-03-25 19:31:02,885][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:31:02,948][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:31:02,949][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:31:02,949][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:31:03,668][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:31:04,295][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:31:04,945][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:31:05,604][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:31:06,264][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:31:06,924][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:31:07,584][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:31:08,242][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:31:08,901][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:31:09,560][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:31:10,221][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:31:10,880][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:31:11,540][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:31:12,199][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:31:12,859][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:31:13,518][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:31:14,178][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:31:14,836][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:31:15,495][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:31:16,155][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:31:16,813][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:31:17,471][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:31:18,131][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:31:18,789][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:31:19,448][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:31:20,106][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:31:20,766][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:31:21,426][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:31:22,085][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:31:22,743][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:31:23,402][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:31:24,060][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:31:24,718][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:31:25,376][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:31:26,035][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:31:26,693][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:31:27,352][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:31:28,011][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:31:28,670][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:31:29,329][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:31:29,987][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:31:30,647][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:31:31,306][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:31:31,965][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:31:32,624][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:31:33,282][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:31:33,941][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:31:34,599][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:31:35,585][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:31:36,244][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:31:36,903][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:31:37,561][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:31:38,219][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:31:38,877][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:31:39,535][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:31:40,193][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:31:40,853][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:31:41,513][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:31:42,172][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:31:42,831][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:31:43,490][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:31:44,149][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:31:44,809][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:31:45,468][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:31:46,126][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:31:46,888][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 19:31:48,253][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:31:48,256][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:31:48,258][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:31:49,658][__main__][INFO] - Iteration 352 took 53s (10.58% Gen, 86.79% Train). Generation: 5s, Training: 46s. Estimated remaining time: 9h 23m 6s. Estimated total time: 14h 48m 50s. Time estimates for 10 more iterations: 8m 53s, 100 more iterations: 1h 28m 53s, 500 more iterations: 7h 24m 25s. [2026-03-25 19:31:49,660][__main__][INFO] - Starting iteration 352. [2026-03-25 19:31:49,664][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 19:31:49,665][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:31:54,816][__main__][INFO] - Number of regex retries in iteration 352: 0 [2026-03-25 19:31:54,817][__main__][INFO] - agents played in iteration 352 are Bob, Alice [2026-03-25 19:31:55,298][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:31:55,359][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:31:55,359][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:31:55,360][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:31:56,222][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:31:56,844][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:31:57,505][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:31:58,163][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:31:58,824][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:31:59,483][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:32:00,144][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:32:00,803][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:32:01,463][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:32:02,123][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:32:02,782][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:32:03,443][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:32:04,102][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:32:04,761][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:32:05,420][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:32:06,078][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:32:06,736][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:32:07,395][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:32:08,053][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:32:08,713][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:32:09,372][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:32:10,031][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:32:10,689][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:32:11,348][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:32:12,006][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:32:12,666][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:32:13,325][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:32:13,984][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:32:14,643][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:32:15,320][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:32:16,116][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:32:16,818][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:32:17,479][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:32:18,136][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:32:18,796][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:32:19,456][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:32:20,117][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:32:20,776][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:32:21,435][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:32:22,097][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:32:22,756][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:32:23,416][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:32:24,076][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:32:24,736][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:32:25,396][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:32:26,055][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:32:26,717][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:32:27,376][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:32:28,424][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:32:29,085][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:32:29,743][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:32:30,400][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:32:31,059][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:32:31,717][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:32:32,375][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:32:33,033][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:32:33,691][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:32:34,351][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:32:35,009][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:32:35,666][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:32:36,325][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:32:36,983][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:32:37,640][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:32:38,299][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:32:38,958][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:32:39,874][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 19:32:41,327][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:32:41,330][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:32:41,331][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:32:42,754][__main__][INFO] - Iteration 353 took 53s (9.70% Gen, 87.61% Train). Generation: 5s, Training: 46s. Estimated remaining time: 9h 18m 14s. Estimated total time: 14h 44m 51s. Time estimates for 10 more iterations: 8m 50s, 100 more iterations: 1h 28m 29s, 500 more iterations: 7h 22m 25s. [2026-03-25 19:32:42,764][__main__][INFO] - Starting iteration 353. [2026-03-25 19:32:42,797][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 19:32:42,798][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:32:48,214][__main__][INFO] - Number of regex retries in iteration 353: 0 [2026-03-25 19:32:48,215][__main__][INFO] - agents played in iteration 353 are Bob, Alice [2026-03-25 19:32:48,758][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:32:48,822][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:32:48,823][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:32:48,823][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:32:49,538][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:32:50,153][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:32:50,813][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:32:51,473][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:32:52,134][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:32:52,793][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:32:53,451][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:32:54,110][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:32:54,769][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:32:55,427][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:32:56,087][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:32:56,749][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:32:57,408][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:32:58,067][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:32:58,733][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:32:59,394][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:33:00,054][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:33:00,714][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:33:01,375][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:33:02,034][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:33:02,692][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:33:03,350][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:33:04,009][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:33:04,668][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:33:05,327][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:33:05,986][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:33:06,650][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:33:07,309][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:33:07,968][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:33:08,626][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:33:09,286][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:33:09,946][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:33:10,605][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:33:11,263][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:33:11,922][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:33:12,580][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:33:13,239][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:33:13,897][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:33:14,556][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:33:15,214][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:33:15,872][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:33:16,531][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:33:17,190][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:33:17,848][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:33:18,507][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:33:19,166][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:33:19,824][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:33:20,483][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:33:21,472][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:33:22,130][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:33:22,788][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:33:23,446][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:33:24,104][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:33:24,764][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:33:25,422][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:33:26,080][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:33:26,738][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:33:27,397][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:33:28,054][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:33:28,713][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:33:29,373][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:33:30,031][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:33:30,689][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:33:31,347][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:33:32,004][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:33:32,790][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 19:33:36,281][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:33:36,284][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:33:36,285][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:33:37,674][__main__][INFO] - Iteration 354 took 54s (9.87% Gen, 87.59% Train). Generation: 5s, Training: 48s. Estimated remaining time: 9h 47m 6s. Estimated total time: 15h 14m 39s. Time estimates for 10 more iterations: 9m 8s, 100 more iterations: 1h 31m 27s, 500 more iterations: 7h 37m 19s. [2026-03-25 19:33:37,677][__main__][INFO] - Starting iteration 354. [2026-03-25 19:33:37,681][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 19:33:37,682][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:33:43,271][__main__][INFO] - Number of regex retries in iteration 354: 0 [2026-03-25 19:33:43,273][__main__][INFO] - agents played in iteration 354 are Bob, Alice [2026-03-25 19:33:43,850][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:33:43,912][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:33:43,913][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:33:43,913][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:33:44,740][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:33:45,348][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:33:46,007][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:33:46,666][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:33:47,325][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:33:47,983][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:33:48,642][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:33:49,300][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:33:49,959][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:33:50,618][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:33:51,278][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:33:51,935][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:33:52,594][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:33:53,254][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:33:53,913][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:33:54,571][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:33:55,230][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:33:55,889][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:33:56,548][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:33:57,206][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:33:57,864][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:33:58,524][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:33:59,184][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:33:59,842][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:34:00,501][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:34:01,160][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:34:01,820][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:34:02,479][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:34:03,138][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:34:03,796][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:34:04,454][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:34:05,113][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:34:05,771][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:34:06,430][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:34:07,088][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:34:07,749][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:34:08,407][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:34:09,066][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:34:09,725][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:34:10,385][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:34:11,044][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:34:11,704][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:34:12,364][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:34:13,023][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:34:13,682][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:34:14,341][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:34:15,000][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:34:15,658][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:34:16,642][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:34:17,303][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:34:17,962][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:34:18,621][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:34:19,278][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:34:19,936][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:34:20,597][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:34:21,254][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:34:21,912][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:34:22,570][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:34:23,229][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:34:23,890][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:34:24,552][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:34:25,207][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:34:25,867][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:34:26,526][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:34:27,184][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:34:27,973][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 19:34:29,344][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:34:29,346][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:34:29,347][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:34:30,850][__main__][INFO] - Iteration 355 took 53s (10.52% Gen, 86.65% Train). Generation: 5s, Training: 46s. Estimated remaining time: 9h 17m 45s. Estimated total time: 14h 46m 10s. Time estimates for 10 more iterations: 8m 51s, 100 more iterations: 1h 28m 37s, 500 more iterations: 7h 23m 5s. [2026-03-25 19:34:30,852][__main__][INFO] - Starting iteration 355. [2026-03-25 19:34:30,857][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 19:34:30,857][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:34:36,013][__main__][INFO] - Number of regex retries in iteration 355: 0 [2026-03-25 19:34:36,014][__main__][INFO] - agents played in iteration 355 are Bob, Alice [2026-03-25 19:34:36,532][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:34:36,593][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:34:36,594][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:34:36,594][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:34:37,407][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:34:38,023][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:34:38,679][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:34:39,338][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:34:39,999][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:34:40,657][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:34:41,317][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:34:41,976][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:34:42,636][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:34:43,294][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:34:43,954][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:34:44,613][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:34:45,273][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:34:45,932][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:34:46,591][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:34:47,251][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:34:47,910][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:34:48,571][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:34:49,230][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:34:49,889][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:34:50,547][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:34:51,211][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:34:51,869][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:34:52,529][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:34:53,188][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:34:53,847][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:34:54,506][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:34:55,165][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:34:55,826][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:34:56,485][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:34:57,143][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:34:57,802][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:34:58,462][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:34:59,122][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:34:59,782][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:35:00,441][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:35:01,102][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:35:01,762][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:35:02,422][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:35:03,084][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:35:03,742][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:35:04,401][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:35:05,060][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:35:05,718][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:35:06,377][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:35:07,036][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:35:07,695][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:35:08,353][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:35:09,348][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:35:10,009][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:35:10,665][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:35:11,323][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:35:11,981][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:35:12,639][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:35:13,296][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:35:13,957][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:35:14,617][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:35:15,275][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:35:15,933][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:35:16,594][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:35:17,252][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:35:17,909][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:35:18,567][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:35:19,226][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:35:19,884][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:35:20,767][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 19:35:22,187][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:35:22,190][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:35:22,191][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:35:23,712][__main__][INFO] - Iteration 356 took 52s (9.76% Gen, 87.36% Train). Generation: 5s, Training: 46s. Estimated remaining time: 9h 11m 39s. Estimated total time: 14h 40m 57s. Time estimates for 10 more iterations: 8m 48s, 100 more iterations: 1h 28m 5s, 500 more iterations: 7h 20m 28s. [2026-03-25 19:35:25,218][__main__][INFO] - Starting iteration 356. [2026-03-25 19:35:25,223][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 19:35:25,223][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:35:31,001][__main__][INFO] - Number of regex retries in iteration 356: 0 [2026-03-25 19:35:31,003][__main__][INFO] - agents played in iteration 356 are Bob, Alice [2026-03-25 19:35:31,849][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:35:31,911][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:35:31,911][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:35:31,912][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:35:32,587][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:35:33,212][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:35:33,873][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:35:34,533][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:35:35,195][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:35:35,854][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:35:36,516][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:35:37,175][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:35:37,836][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:35:38,496][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:35:39,156][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:35:39,816][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:35:40,476][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:35:41,135][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:35:41,796][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:35:42,455][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:35:43,115][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:35:43,774][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:35:44,433][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:35:45,091][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:35:45,750][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:35:46,409][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:35:47,067][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:35:47,727][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:35:48,386][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:35:49,044][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:35:49,702][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:35:50,360][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:35:51,020][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:35:51,678][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:35:52,337][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:35:52,995][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:35:53,653][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:35:54,316][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:35:54,977][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:35:55,636][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:35:56,296][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:35:56,956][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:35:57,616][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:35:58,276][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:35:58,936][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:35:59,596][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:36:00,256][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:36:00,917][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:36:01,577][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:36:02,237][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:36:02,896][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:36:03,556][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:36:04,561][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:36:05,219][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:36:05,878][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:36:06,535][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:36:07,192][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:36:07,852][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:36:08,512][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:36:09,170][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:36:09,828][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:36:10,488][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:36:11,147][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:36:11,806][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:36:12,464][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:36:13,123][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:36:13,781][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:36:14,439][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:36:15,096][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:36:16,556][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 19:36:17,973][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:36:17,976][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:36:17,978][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:36:19,402][__main__][INFO] - Iteration 357 took 54s (10.67% Gen, 86.70% Train). Generation: 5s, Training: 46s. Estimated remaining time: 9h 32m 47s. Estimated total time: 15h 3m 1s. Time estimates for 10 more iterations: 9m 1s, 100 more iterations: 1h 30m 18s, 500 more iterations: 7h 31m 30s. [2026-03-25 19:36:19,404][__main__][INFO] - Starting iteration 357. [2026-03-25 19:36:19,408][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 19:36:19,409][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:36:27,690][__main__][INFO] - Number of regex retries in iteration 357: 0 [2026-03-25 19:36:27,691][__main__][INFO] - agents played in iteration 357 are Bob, Alice [2026-03-25 19:36:28,279][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:36:28,342][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:36:28,343][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:36:28,343][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:36:29,052][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:36:29,663][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:36:30,323][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:36:30,982][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:36:31,642][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:36:32,301][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:36:32,958][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:36:33,618][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:36:34,278][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:36:34,938][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:36:35,598][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:36:36,257][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:36:36,916][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:36:37,575][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:36:38,236][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:36:38,895][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:36:39,557][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:36:40,217][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:36:40,876][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:36:41,535][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:36:42,194][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:36:42,856][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:36:43,515][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:36:44,176][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:36:44,834][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:36:45,493][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:36:46,153][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:36:46,811][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:36:47,470][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:36:48,129][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:36:48,788][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:36:49,447][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:36:50,105][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:36:50,763][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:36:51,422][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:36:52,081][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:36:52,740][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:36:53,398][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:36:54,057][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:36:54,716][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:36:55,374][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:36:56,033][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:36:56,692][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:36:57,351][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:36:58,011][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:36:58,671][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:36:59,333][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:36:59,991][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:37:00,968][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:37:01,627][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:37:02,286][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:37:02,945][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:37:03,604][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:37:04,262][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:37:04,922][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:37:05,579][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:37:06,239][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:37:06,899][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:37:07,557][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:37:08,214][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:37:08,872][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:37:09,532][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:37:10,190][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:37:10,850][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:37:11,508][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:37:12,243][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 19:37:13,666][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:37:13,669][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:37:13,671][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:37:15,054][__main__][INFO] - Iteration 358 took 55s (14.88% Gen, 82.63% Train). Generation: 8s, Training: 45s. Estimated remaining time: 9h 56m 18s. Estimated total time: 15h 27m 27s. Time estimates for 10 more iterations: 9m 16s, 100 more iterations: 1h 32m 44s, 500 more iterations: 7h 43m 43s. [2026-03-25 19:37:15,056][__main__][INFO] - Starting iteration 358. [2026-03-25 19:37:15,060][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 19:37:15,061][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:37:20,637][__main__][INFO] - Number of regex retries in iteration 358: 0 [2026-03-25 19:37:20,638][__main__][INFO] - agents played in iteration 358 are Bob, Alice [2026-03-25 19:37:21,141][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:37:21,201][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:37:21,202][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:37:21,203][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:37:22,057][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:37:22,671][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:37:23,335][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:37:23,994][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:37:24,653][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:37:25,314][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:37:25,972][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:37:26,631][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:37:27,290][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:37:27,949][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:37:28,610][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:37:29,270][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:37:29,929][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:37:30,588][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:37:31,247][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:37:31,906][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:37:32,565][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:37:33,223][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:37:33,883][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:37:34,542][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:37:35,201][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:37:35,859][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:37:36,517][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:37:37,176][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:37:37,835][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:37:38,494][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:37:39,153][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:37:39,813][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:37:40,471][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:37:41,130][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:37:41,789][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:37:42,449][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:37:43,107][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:37:43,766][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:37:44,425][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:37:45,084][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:37:45,743][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:37:46,402][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:37:47,062][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:37:47,721][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:37:48,379][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:37:49,038][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:37:49,700][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:37:50,358][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:37:51,018][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:37:51,679][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:37:52,338][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:37:52,998][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:37:53,977][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:37:54,637][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:37:55,297][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:37:55,956][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:37:56,615][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:37:57,273][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:37:57,932][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:37:58,592][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:37:59,251][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:37:59,910][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:38:00,569][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:38:01,228][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:38:01,885][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:38:02,543][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:38:03,202][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:38:03,861][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:38:04,519][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:38:05,355][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 19:38:07,171][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:38:07,174][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:38:07,176][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:38:08,614][__main__][INFO] - Iteration 359 took 53s (10.41% Gen, 86.90% Train). Generation: 5s, Training: 46s. Estimated remaining time: 9h 20m 31s. Estimated total time: 14h 52m 35s. Time estimates for 10 more iterations: 8m 55s, 100 more iterations: 1h 29m 15s, 500 more iterations: 7h 26m 17s. [2026-03-25 19:38:08,616][__main__][INFO] - Starting iteration 359. [2026-03-25 19:38:08,620][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 19:38:08,620][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:38:13,635][__main__][INFO] - Number of regex retries in iteration 359: 0 [2026-03-25 19:38:13,637][__main__][INFO] - agents played in iteration 359 are Bob, Alice [2026-03-25 19:38:14,129][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:38:14,192][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:38:14,193][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:38:14,194][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:38:14,968][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:38:15,575][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:38:16,235][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:38:16,898][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:38:17,561][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:38:18,220][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:38:18,879][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:38:19,538][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:38:20,197][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:38:20,856][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:38:21,516][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:38:22,175][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:38:22,834][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:38:23,493][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:38:24,152][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:38:24,811][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:38:25,475][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:38:26,134][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:38:26,793][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:38:27,451][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:38:28,110][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:38:28,770][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:38:29,428][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:38:30,091][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:38:30,750][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:38:31,409][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:38:32,069][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:38:32,728][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:38:33,387][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:38:34,046][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:38:34,705][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:38:35,364][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:38:36,023][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:38:36,682][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:38:37,340][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:38:37,999][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:38:38,659][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:38:39,319][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:38:39,978][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:38:40,637][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:38:41,296][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:38:41,954][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:38:42,613][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:38:43,271][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:38:43,931][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:38:44,590][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:38:45,249][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:38:45,909][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:38:46,892][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:38:47,550][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:38:48,211][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:38:48,870][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:38:49,530][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:38:50,189][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:38:50,849][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:38:51,508][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:38:52,168][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:38:52,826][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:38:53,485][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:38:54,142][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:38:54,801][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:38:55,460][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:38:56,118][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:38:56,776][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:38:57,434][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:38:58,156][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 19:38:59,534][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:38:59,536][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:38:59,537][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:39:00,911][__main__][INFO] - Iteration 360 took 52s (9.59% Gen, 87.77% Train). Generation: 5s, Training: 45s. Estimated remaining time: 8h 58m 37s. Estimated total time: 14h 31m 33s. Time estimates for 10 more iterations: 8m 42s, 100 more iterations: 1h 27m 9s, 500 more iterations: 7h 15m 46s. [2026-03-25 19:39:00,913][__main__][INFO] - Starting iteration 360. [2026-03-25 19:39:00,917][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 19:39:00,918][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:39:06,094][__main__][INFO] - Number of regex retries in iteration 360: 0 [2026-03-25 19:39:06,095][__main__][INFO] - agents played in iteration 360 are Bob, Alice [2026-03-25 19:39:06,723][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:39:06,785][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:39:06,785][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:39:06,786][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:39:07,621][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:39:08,234][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:39:08,894][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:39:09,554][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:39:10,212][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:39:10,871][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:39:11,536][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:39:12,194][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:39:12,854][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:39:13,516][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:39:14,176][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:39:14,835][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:39:15,496][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:39:16,156][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:39:16,815][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:39:17,475][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:39:18,136][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:39:18,795][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:39:19,455][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:39:20,115][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:39:20,775][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:39:21,436][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:39:22,098][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:39:22,756][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:39:23,415][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:39:24,074][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:39:24,737][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:39:25,399][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:39:26,057][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:39:26,717][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:39:27,376][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:39:28,034][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:39:28,693][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:39:29,353][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:39:30,012][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:39:30,671][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:39:31,330][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:39:31,988][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:39:32,647][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:39:33,306][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:39:33,965][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:39:34,623][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:39:35,281][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:39:35,940][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:39:36,598][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:39:37,258][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:39:37,918][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:39:38,578][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:39:39,576][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:39:40,234][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:39:40,897][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:39:41,555][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:39:42,215][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:39:42,873][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:39:43,531][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:39:44,188][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:39:44,845][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:39:45,504][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:39:46,161][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:39:46,818][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:39:47,476][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:39:48,133][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:39:48,790][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:39:49,448][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:39:50,105][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:39:50,885][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 19:39:52,301][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:39:52,303][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:39:52,305][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:39:53,675][__main__][INFO] - Iteration 361 took 52s (9.81% Gen, 87.59% Train). Generation: 5s, Training: 46s. Estimated remaining time: 9h 5m 31s. Estimated total time: 14h 39m 19s. Time estimates for 10 more iterations: 8m 47s, 100 more iterations: 1h 27m 55s, 500 more iterations: 7h 19m 39s. [2026-03-25 19:39:53,678][__main__][INFO] - Starting iteration 361. [2026-03-25 19:39:53,681][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 19:39:53,682][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:40:00,276][__main__][INFO] - Number of regex retries in iteration 361: 0 [2026-03-25 19:40:00,277][__main__][INFO] - agents played in iteration 361 are Bob, Alice [2026-03-25 19:40:01,372][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:40:01,435][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:40:01,436][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:40:01,436][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:40:02,155][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:40:02,771][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:40:03,433][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:40:04,092][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:40:04,753][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:40:05,415][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:40:06,074][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:40:06,733][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:40:07,392][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:40:08,051][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:40:08,710][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:40:09,368][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:40:10,027][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:40:10,686][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:40:11,345][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:40:12,003][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:40:12,662][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:40:13,321][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:40:13,979][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:40:14,637][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:40:15,297][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:40:15,955][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:40:16,613][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:40:17,275][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:40:17,935][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:40:18,595][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:40:19,253][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:40:19,912][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:40:20,574][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:40:21,232][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:40:21,890][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:40:22,548][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:40:23,207][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:40:23,865][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:40:24,522][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:40:25,180][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:40:25,838][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:40:26,497][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:40:27,158][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:40:27,818][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:40:28,481][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:40:29,137][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:40:29,798][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:40:30,459][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:40:31,119][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:40:31,778][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:40:32,436][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:40:33,094][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:40:34,086][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:40:34,744][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:40:35,405][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:40:36,063][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:40:36,721][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:40:37,380][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:40:38,039][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:40:38,697][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:40:39,356][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:40:40,014][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:40:40,673][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:40:41,330][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:40:41,988][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:40:42,646][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:40:43,304][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:40:43,962][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:40:44,621][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:40:45,446][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 19:40:46,840][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:40:46,842][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:40:46,844][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:40:48,413][__main__][INFO] - Iteration 362 took 54s (12.05% Gen, 85.08% Train). Generation: 6s, Training: 46s. Estimated remaining time: 9h 37m 29s. Estimated total time: 15h 12m 12s. Time estimates for 10 more iterations: 9m 7s, 100 more iterations: 1h 31m 13s, 500 more iterations: 7h 36m 6s. [2026-03-25 19:40:48,415][__main__][INFO] - Starting iteration 362. [2026-03-25 19:40:48,420][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 19:40:48,420][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:41:00,181][__main__][INFO] - Number of regex retries in iteration 362: 0 [2026-03-25 19:41:00,182][__main__][INFO] - agents played in iteration 362 are Bob, Alice [2026-03-25 19:41:00,728][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:41:00,789][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:41:00,789][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:41:00,790][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:41:01,568][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:41:02,191][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:41:02,850][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:41:03,504][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:41:04,164][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:41:04,823][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:41:05,482][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:41:06,140][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:41:06,798][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:41:07,455][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:41:08,112][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:41:08,771][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:41:09,431][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:41:10,089][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:41:10,748][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:41:11,406][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:41:12,065][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:41:12,723][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:41:13,380][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:41:14,038][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:41:14,695][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:41:15,352][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:41:16,010][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:41:16,667][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:41:17,324][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:41:17,982][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:41:18,644][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:41:19,302][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:41:19,961][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:41:20,620][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:41:21,277][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:41:21,934][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:41:22,591][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:41:23,248][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:41:23,905][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:41:24,563][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:41:25,220][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:41:25,877][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:41:26,535][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:41:27,192][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:41:27,850][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:41:28,508][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:41:29,166][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:41:29,824][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:41:30,486][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:41:31,144][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:41:31,803][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:41:32,463][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:41:33,451][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:41:34,112][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:41:34,770][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:41:35,430][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:41:36,089][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:41:36,748][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:41:37,406][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:41:38,065][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:41:38,724][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:41:39,383][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:41:40,042][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:41:40,702][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:41:41,359][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:41:42,018][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:41:42,677][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:41:43,336][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:41:43,994][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:41:44,917][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 19:41:46,268][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:41:46,271][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:41:46,272][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:41:47,805][__main__][INFO] - Iteration 363 took 59s (19.81% Gen, 77.61% Train). Generation: 11s, Training: 46s. Estimated remaining time: 10h 54m 3s. Estimated total time: 16h 29m 46s. Time estimates for 10 more iterations: 9m 53s, 100 more iterations: 1h 38m 58s, 500 more iterations: 8h 14m 53s. [2026-03-25 19:41:47,807][__main__][INFO] - Starting iteration 363. [2026-03-25 19:41:47,811][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 19:41:47,811][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:41:52,666][__main__][INFO] - Number of regex retries in iteration 363: 0 [2026-03-25 19:41:52,667][__main__][INFO] - agents played in iteration 363 are Bob, Alice [2026-03-25 19:41:53,148][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:41:53,210][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:41:53,211][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:41:53,211][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:41:54,000][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:41:54,613][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:41:55,273][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:41:55,933][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:41:56,592][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:41:57,252][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:41:57,912][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:41:58,571][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:41:59,232][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:41:59,893][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:42:00,552][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:42:01,212][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:42:01,871][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:42:02,531][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:42:03,191][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:42:03,850][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:42:04,510][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:42:05,168][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:42:05,827][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:42:06,485][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:42:07,144][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:42:07,803][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:42:08,461][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:42:09,119][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:42:09,781][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:42:10,441][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:42:11,101][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:42:11,761][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:42:12,421][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:42:13,082][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:42:13,741][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:42:14,400][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:42:15,060][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:42:15,719][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:42:16,378][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:42:17,039][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:42:17,699][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:42:18,358][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:42:19,017][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:42:19,675][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:42:20,334][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:42:20,993][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:42:21,658][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:42:22,318][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:42:22,974][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:42:23,632][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:42:24,291][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:42:24,955][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:42:25,949][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:42:26,612][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:42:27,273][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:42:27,930][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:42:28,589][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:42:29,250][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:42:29,909][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:42:30,568][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:42:31,227][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:42:31,885][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:42:32,544][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:42:33,202][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:42:33,861][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:42:34,520][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:42:35,177][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:42:35,835][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:42:36,494][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:42:37,282][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 19:42:38,677][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:42:38,679][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:42:38,681][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:42:39,962][__main__][INFO] - Iteration 364 took 52s (9.31% Gen, 88.23% Train). Generation: 4s, Training: 46s. Estimated remaining time: 8h 52m 38s. Estimated total time: 14h 29m 12s. Time estimates for 10 more iterations: 8m 41s, 100 more iterations: 1h 26m 55s, 500 more iterations: 7h 14m 36s. [2026-03-25 19:42:39,964][__main__][INFO] - Starting iteration 364. [2026-03-25 19:42:39,968][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 19:42:39,969][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:42:45,527][__main__][INFO] - Number of regex retries in iteration 364: 0 [2026-03-25 19:42:45,528][__main__][INFO] - agents played in iteration 364 are Bob, Alice [2026-03-25 19:42:46,143][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:42:46,205][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:42:46,206][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:42:46,206][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:42:46,880][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:42:47,514][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:42:48,172][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:42:48,830][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:42:49,488][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:42:50,146][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:42:50,805][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:42:51,463][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:42:52,121][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:42:52,780][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:42:53,439][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:42:54,097][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:42:54,756][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:42:55,415][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:42:56,081][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:42:56,742][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:42:57,400][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:42:58,060][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:42:58,719][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:42:59,379][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:43:00,039][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:43:00,697][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:43:01,356][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:43:02,014][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:43:02,671][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:43:03,332][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:43:03,989][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:43:04,646][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:43:05,305][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:43:05,962][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:43:06,622][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:43:07,280][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:43:07,939][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:43:08,598][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:43:09,256][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:43:09,915][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:43:10,573][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:43:11,234][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:43:11,893][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:43:12,553][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:43:13,213][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:43:13,873][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:43:14,533][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:43:15,192][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:43:15,850][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:43:16,510][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:43:17,170][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:43:17,829][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:43:18,809][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:43:19,468][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:43:20,127][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:43:20,787][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:43:21,445][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:43:22,104][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:43:22,763][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:43:23,422][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:43:24,081][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:43:24,740][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:43:25,401][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:43:26,062][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:43:26,721][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:43:27,377][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:43:28,036][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:43:28,696][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:43:29,356][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:43:30,198][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 19:43:31,430][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:43:31,433][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:43:31,434][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:43:32,733][__main__][INFO] - Iteration 365 took 52s (10.53% Gen, 87.00% Train). Generation: 5s, Training: 45s. Estimated remaining time: 9h 1m 59s. Estimated total time: 14h 39m 27s. Time estimates for 10 more iterations: 8m 47s, 100 more iterations: 1h 27m 56s, 500 more iterations: 7h 19m 43s. [2026-03-25 19:43:32,736][__main__][INFO] - Starting iteration 365. [2026-03-25 19:43:32,739][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 19:43:32,739][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:43:38,112][__main__][INFO] - Number of regex retries in iteration 365: 0 [2026-03-25 19:43:38,113][__main__][INFO] - agents played in iteration 365 are Bob, Alice [2026-03-25 19:43:38,950][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:43:39,012][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:43:39,012][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:43:39,013][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:43:39,663][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:43:40,288][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:43:40,950][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:43:41,610][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:43:42,271][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:43:42,930][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:43:43,591][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:43:44,250][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:43:44,909][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:43:45,569][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:43:46,229][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:43:46,888][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:43:47,549][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:43:48,208][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:43:48,868][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:43:49,526][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:43:50,184][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:43:50,844][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:43:51,503][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:43:52,163][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:43:52,822][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:43:53,480][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:43:54,139][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:43:54,798][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:43:55,458][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:43:56,118][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:43:56,777][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:43:57,437][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:43:58,096][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:43:58,756][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:43:59,415][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:44:00,077][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:44:00,735][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:44:01,396][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:44:02,055][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:44:02,716][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:44:03,376][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:44:04,040][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:44:04,695][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:44:05,358][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:44:06,019][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:44:06,679][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:44:07,338][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:44:07,997][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:44:08,655][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:44:09,313][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:44:09,973][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:44:10,632][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:44:11,643][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:44:12,303][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:44:12,962][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:44:13,620][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:44:14,278][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:44:14,936][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:44:15,593][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:44:16,251][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:44:16,910][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:44:17,568][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:44:18,226][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:44:18,884][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:44:19,542][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:44:20,201][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:44:20,859][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:44:21,525][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:44:22,185][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:44:22,992][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 19:44:24,719][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:44:24,722][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:44:24,723][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:44:26,182][__main__][INFO] - Iteration 366 took 53s (10.05% Gen, 87.21% Train). Generation: 5s, Training: 46s. Estimated remaining time: 9h 12m 24s. Estimated total time: 14h 50m 45s. Time estimates for 10 more iterations: 8m 54s, 100 more iterations: 1h 29m 4s, 500 more iterations: 7h 25m 22s. [2026-03-25 19:44:26,185][__main__][INFO] - Starting iteration 366. [2026-03-25 19:44:26,189][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 19:44:26,190][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:44:32,365][__main__][INFO] - Number of regex retries in iteration 366: 0 [2026-03-25 19:44:32,367][__main__][INFO] - agents played in iteration 366 are Bob, Alice [2026-03-25 19:44:33,471][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:44:33,538][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:44:33,539][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:44:33,539][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:44:34,208][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:44:34,822][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:44:35,483][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:44:36,141][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:44:36,800][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:44:37,457][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:44:38,116][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:44:38,774][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:44:39,432][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:44:40,089][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:44:40,750][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:44:41,410][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:44:42,069][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:44:42,728][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:44:43,388][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:44:44,046][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:44:44,706][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:44:45,366][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:44:46,027][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:44:46,687][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:44:47,346][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:44:48,004][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:44:48,663][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:44:49,320][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:44:49,979][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:44:50,640][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:44:51,299][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:44:51,958][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:44:52,618][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:44:53,278][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:44:53,935][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:44:54,593][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:44:55,252][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:44:55,912][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:44:56,570][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:44:57,229][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:44:57,888][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:44:58,547][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:44:59,206][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:44:59,865][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:45:00,525][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:45:01,183][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:45:01,842][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:45:02,501][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:45:03,160][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:45:03,821][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:45:04,479][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:45:05,138][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:45:06,114][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:45:06,772][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:45:07,430][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:45:08,088][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:45:08,746][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:45:09,411][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:45:10,069][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:45:10,728][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:45:11,388][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:45:12,046][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:45:12,705][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:45:13,363][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:45:14,021][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:45:14,680][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:45:15,338][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:45:15,997][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:45:16,656][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:45:17,420][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 19:45:18,785][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:45:18,788][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:45:18,790][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:45:20,191][__main__][INFO] - Iteration 367 took 54s (11.44% Gen, 85.96% Train). Generation: 6s, Training: 46s. Estimated remaining time: 9h 20m 48s. Estimated total time: 15h 0m 3s. Time estimates for 10 more iterations: 9m 0s, 100 more iterations: 1h 30m 0s, 500 more iterations: 7h 30m 1s. [2026-03-25 19:45:20,193][__main__][INFO] - Starting iteration 367. [2026-03-25 19:45:20,197][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 19:45:20,198][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:45:25,746][__main__][INFO] - Number of regex retries in iteration 367: 0 [2026-03-25 19:45:25,747][__main__][INFO] - agents played in iteration 367 are Bob, Alice [2026-03-25 19:45:26,233][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:45:26,296][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:45:26,297][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:45:26,297][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:45:27,232][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:45:27,845][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:45:28,505][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:45:29,165][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:45:29,823][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:45:30,482][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:45:31,139][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:45:31,797][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:45:32,455][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:45:33,115][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:45:33,773][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:45:34,430][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:45:35,087][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:45:35,745][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:45:36,403][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:45:37,061][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:45:37,718][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:45:38,377][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:45:39,035][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:45:39,694][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:45:40,352][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:45:41,010][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:45:41,668][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:45:42,326][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:45:42,984][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:45:43,642][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:45:44,301][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:45:44,959][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:45:45,617][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:45:46,275][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:45:46,933][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:45:47,590][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:45:48,248][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:45:48,906][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:45:49,563][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:45:50,221][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:45:50,879][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:45:51,538][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:45:52,195][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:45:52,855][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:45:53,513][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:45:54,170][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:45:54,828][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:45:55,488][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:45:56,146][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:45:56,803][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:45:57,461][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:45:58,119][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:45:59,125][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:45:59,784][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:46:00,442][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:46:01,101][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:46:01,760][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:46:02,419][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:46:03,078][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:46:03,737][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:46:04,395][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:46:05,052][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:46:05,711][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:46:06,370][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:46:07,029][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:46:07,687][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:46:08,345][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:46:09,002][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:46:09,661][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:46:10,487][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 19:46:11,871][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:46:11,874][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:46:11,875][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:46:13,202][__main__][INFO] - Iteration 368 took 53s (10.47% Gen, 87.02% Train). Generation: 5s, Training: 46s. Estimated remaining time: 9h 3m 19s. Estimated total time: 14h 43m 26s. Time estimates for 10 more iterations: 8m 50s, 100 more iterations: 1h 28m 20s, 500 more iterations: 7h 21m 43s. [2026-03-25 19:46:13,204][__main__][INFO] - Starting iteration 368. [2026-03-25 19:46:13,208][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 19:46:13,209][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:46:18,978][__main__][INFO] - Number of regex retries in iteration 368: 0 [2026-03-25 19:46:18,979][__main__][INFO] - agents played in iteration 368 are Bob, Alice [2026-03-25 19:46:19,508][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:46:19,571][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:46:19,572][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:46:19,572][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:46:20,415][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:46:21,040][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:46:21,699][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:46:22,357][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:46:23,015][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:46:23,673][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:46:24,330][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:46:24,989][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:46:25,647][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:46:26,306][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:46:26,965][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:46:27,623][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:46:28,282][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:46:28,940][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:46:29,598][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:46:30,261][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:46:30,922][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:46:31,580][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:46:32,238][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:46:32,896][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:46:33,554][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:46:34,212][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:46:34,871][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:46:35,530][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:46:36,188][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:46:36,845][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:46:37,504][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:46:38,164][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:46:38,821][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:46:39,480][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:46:40,138][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:46:40,796][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:46:41,455][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:46:42,113][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:46:42,770][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:46:43,429][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:46:44,087][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:46:44,746][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:46:45,404][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:46:46,063][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:46:46,720][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:46:47,378][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:46:48,036][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:46:48,695][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:46:49,353][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:46:50,010][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:46:50,668][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:46:51,326][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:46:52,315][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:46:52,976][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:46:53,637][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:46:54,296][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:46:54,956][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:46:55,614][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:46:56,273][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:46:56,933][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:46:57,591][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:46:58,251][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:46:58,910][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:46:59,570][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:47:00,228][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:47:00,887][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:47:01,546][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:47:02,205][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:47:02,863][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:47:03,667][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 19:47:05,096][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:47:05,099][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:47:05,109][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:47:06,810][__main__][INFO] - Iteration 369 took 53s (10.76% Gen, 86.06% Train). Generation: 5s, Training: 46s. Estimated remaining time: 9h 12m 22s. Estimated total time: 14h 53m 23s. Time estimates for 10 more iterations: 8m 56s, 100 more iterations: 1h 29m 20s, 500 more iterations: 7h 26m 41s. [2026-03-25 19:47:06,813][__main__][INFO] - Starting iteration 369. [2026-03-25 19:47:06,817][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 19:47:06,817][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:47:11,948][__main__][INFO] - Number of regex retries in iteration 369: 0 [2026-03-25 19:47:11,949][__main__][INFO] - agents played in iteration 369 are Bob, Alice [2026-03-25 19:47:12,532][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:47:12,594][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:47:12,595][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:47:12,596][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:47:13,284][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:47:13,902][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:47:14,562][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:47:15,221][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:47:15,881][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:47:16,540][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:47:17,199][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:47:17,859][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:47:18,517][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:47:19,176][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:47:19,836][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:47:20,495][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:47:21,154][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:47:21,812][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:47:22,472][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:47:23,131][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:47:23,789][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:47:24,447][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:47:25,106][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:47:25,765][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:47:26,423][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:47:27,082][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:47:27,741][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:47:28,400][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:47:29,059][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:47:29,718][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:47:30,376][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:47:31,035][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:47:31,692][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:47:32,351][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:47:33,010][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:47:33,669][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:47:34,328][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:47:34,988][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:47:35,647][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:47:36,305][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:47:36,963][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:47:37,622][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:47:38,280][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:47:38,938][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:47:39,596][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:47:40,254][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:47:40,917][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:47:41,574][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:47:42,233][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:47:42,891][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:47:43,551][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:47:44,209][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:47:45,193][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:47:45,852][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:47:46,511][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:47:47,169][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:47:47,829][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:47:48,487][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:47:49,146][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:47:49,806][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:47:50,466][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:47:51,124][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:47:51,785][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:47:52,443][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:47:53,102][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:47:53,760][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:47:54,419][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:47:55,078][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:47:55,738][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:47:56,518][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 19:47:57,891][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:47:57,894][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:47:57,895][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:47:59,298][__main__][INFO] - Iteration 370 took 52s (9.77% Gen, 87.55% Train). Generation: 5s, Training: 45s. Estimated remaining time: 8h 52m 48s. Estimated total time: 14h 34m 42s. Time estimates for 10 more iterations: 8m 44s, 100 more iterations: 1h 27m 28s, 500 more iterations: 7h 17m 21s. [2026-03-25 19:47:59,300][__main__][INFO] - Starting iteration 370. [2026-03-25 19:47:59,304][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 19:47:59,304][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:48:04,800][__main__][INFO] - Number of regex retries in iteration 370: 0 [2026-03-25 19:48:04,801][__main__][INFO] - agents played in iteration 370 are Bob, Alice [2026-03-25 19:48:05,722][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:48:05,785][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:48:05,786][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:48:05,787][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:48:06,597][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:48:07,213][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:48:07,873][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:48:08,532][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:48:09,191][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:48:09,850][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:48:10,512][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:48:11,171][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:48:11,831][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:48:12,489][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:48:13,148][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:48:13,807][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:48:14,465][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:48:15,123][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:48:15,781][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:48:16,440][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:48:17,097][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:48:17,755][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:48:18,412][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:48:19,070][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:48:19,728][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:48:20,385][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:48:21,043][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:48:21,701][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:48:22,360][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:48:23,016][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:48:23,676][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:48:24,334][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:48:24,993][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:48:25,651][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:48:26,309][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:48:26,967][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:48:27,625][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:48:28,283][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:48:28,942][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:48:29,602][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:48:30,264][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:48:30,921][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:48:31,579][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:48:32,238][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:48:32,896][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:48:33,554][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:48:34,212][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:48:34,869][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:48:35,527][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:48:36,186][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:48:36,845][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:48:37,503][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:48:38,490][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:48:39,150][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:48:39,809][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:48:40,467][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:48:41,125][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:48:41,783][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:48:42,442][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:48:43,101][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:48:43,758][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:48:44,417][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:48:45,075][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:48:45,733][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:48:46,391][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:48:47,048][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:48:47,706][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:48:48,364][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:48:49,024][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:48:49,927][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 19:48:51,323][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:48:51,326][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:48:51,327][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:48:52,720][__main__][INFO] - Iteration 371 took 53s (10.29% Gen, 87.10% Train). Generation: 5s, Training: 46s. Estimated remaining time: 9h 7m 30s. Estimated total time: 14h 50m 18s. Time estimates for 10 more iterations: 8m 54s, 100 more iterations: 1h 29m 1s, 500 more iterations: 7h 25m 9s. [2026-03-25 19:48:52,723][__main__][INFO] - Starting iteration 371. [2026-03-25 19:48:52,727][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 19:48:52,727][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:48:58,421][__main__][INFO] - Number of regex retries in iteration 371: 0 [2026-03-25 19:48:58,422][__main__][INFO] - agents played in iteration 371 are Bob, Alice [2026-03-25 19:48:59,389][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:48:59,450][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:48:59,451][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:48:59,451][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:49:00,318][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:49:00,921][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:49:01,582][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:49:02,241][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:49:02,899][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:49:03,558][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:49:04,215][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:49:04,873][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:49:05,531][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:49:06,191][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:49:06,848][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:49:07,505][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:49:08,163][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:49:08,821][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:49:09,480][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:49:10,139][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:49:10,798][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:49:11,456][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:49:12,114][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:49:12,774][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:49:13,433][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:49:14,091][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:49:14,749][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:49:15,407][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:49:16,064][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:49:16,722][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:49:17,380][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:49:18,038][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:49:18,696][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:49:19,354][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:49:20,014][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:49:20,674][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:49:21,332][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:49:21,991][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:49:22,648][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:49:23,306][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:49:23,964][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:49:24,622][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:49:25,279][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:49:25,937][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:49:26,594][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:49:27,251][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:49:27,908][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:49:28,566][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:49:29,224][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:49:29,882][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:49:30,541][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:49:31,199][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:49:32,179][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:49:32,839][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:49:33,496][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:49:34,155][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:49:34,813][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:49:35,470][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:49:36,129][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:49:36,786][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:49:37,445][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:49:38,103][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:49:38,761][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:49:39,420][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:49:40,078][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:49:40,736][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:49:41,394][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:49:42,052][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:49:42,712][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:49:43,481][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 19:49:44,844][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:49:44,855][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:49:44,857][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:49:46,182][__main__][INFO] - Iteration 372 took 53s (10.65% Gen, 86.86% Train). Generation: 5s, Training: 46s. Estimated remaining time: 9h 7m 16s. Estimated total time: 14h 50m 57s. Time estimates for 10 more iterations: 8m 54s, 100 more iterations: 1h 29m 5s, 500 more iterations: 7h 25m 28s. [2026-03-25 19:49:46,184][__main__][INFO] - Starting iteration 372. [2026-03-25 19:49:46,188][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 19:49:46,189][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:49:57,936][__main__][INFO] - Number of regex retries in iteration 372: 0 [2026-03-25 19:49:57,954][__main__][INFO] - agents played in iteration 372 are Bob, Alice [2026-03-25 19:49:58,555][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:49:58,617][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:49:58,617][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:49:58,618][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:49:59,388][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:50:00,045][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:50:00,703][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:50:01,360][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:50:02,018][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:50:02,677][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:50:03,334][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:50:03,992][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:50:04,651][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:50:05,308][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:50:05,968][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:50:06,625][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:50:07,284][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:50:07,941][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:50:08,601][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:50:09,259][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:50:09,917][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:50:10,575][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:50:11,236][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:50:11,896][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:50:12,555][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:50:13,297][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:50:13,956][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:50:14,615][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:50:15,272][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:50:15,930][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:50:16,588][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:50:17,246][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:50:17,904][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:50:18,563][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:50:19,221][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:50:19,879][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:50:20,538][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:50:21,196][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:50:21,856][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:50:22,514][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:50:23,172][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:50:23,830][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:50:24,487][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:50:25,145][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:50:25,803][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:50:26,461][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:50:27,118][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:50:27,776][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:50:28,435][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:50:29,093][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:50:29,751][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:50:30,409][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:50:31,395][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:50:32,053][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:50:32,711][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:50:33,369][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:50:34,029][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:50:34,687][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:50:35,345][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:50:36,002][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:50:36,659][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:50:37,317][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:50:37,975][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:50:38,633][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:50:39,290][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:50:39,949][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:50:40,606][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:50:41,264][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:50:41,922][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:50:42,677][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 19:50:44,073][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:50:44,076][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:50:44,078][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:50:45,588][__main__][INFO] - Iteration 373 took 59s (19.81% Gen, 77.65% Train). Generation: 11s, Training: 46s. Estimated remaining time: 10h 45m 22s. Estimated total time: 16h 30m 2s. Time estimates for 10 more iterations: 9m 54s, 100 more iterations: 1h 39m 0s, 500 more iterations: 8h 15m 1s. [2026-03-25 19:50:45,591][__main__][INFO] - Starting iteration 373. [2026-03-25 19:50:45,595][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 19:50:45,595][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:50:50,451][__main__][INFO] - Number of regex retries in iteration 373: 0 [2026-03-25 19:50:50,453][__main__][INFO] - agents played in iteration 373 are Bob, Alice [2026-03-25 19:50:50,940][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:50:51,001][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:50:51,002][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:50:51,002][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:50:51,718][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:50:52,349][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:50:53,006][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:50:53,663][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:50:54,321][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:50:54,980][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:50:55,639][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:50:56,297][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:50:56,954][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:50:57,614][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:50:58,273][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:50:58,932][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:50:59,590][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:51:00,247][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:51:00,907][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:51:01,565][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:51:02,223][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:51:02,881][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:51:03,541][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:51:04,199][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:51:04,858][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:51:05,516][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:51:06,174][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:51:06,833][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:51:07,490][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:51:08,148][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:51:08,807][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:51:09,465][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:51:10,123][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:51:10,781][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:51:11,440][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:51:12,098][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:51:12,755][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:51:13,415][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:51:14,073][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:51:14,730][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:51:15,388][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:51:16,046][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:51:16,704][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:51:17,362][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:51:18,020][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:51:18,678][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:51:19,335][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:51:19,994][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:51:20,653][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:51:21,310][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:51:21,968][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:51:22,626][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:51:23,609][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:51:24,270][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:51:24,929][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:51:25,586][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:51:26,245][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:51:26,904][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:51:27,564][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:51:28,224][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:51:28,886][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:51:29,543][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:51:30,202][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:51:30,862][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:51:31,520][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:51:32,179][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:51:32,837][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:51:33,494][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:51:34,152][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:51:35,079][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 19:51:36,499][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:51:36,502][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:51:36,503][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:51:37,968][__main__][INFO] - Iteration 374 took 52s (9.28% Gen, 87.92% Train). Generation: 4s, Training: 46s. Estimated remaining time: 8h 47m 22s. Estimated total time: 14h 32m 55s. Time estimates for 10 more iterations: 8m 43s, 100 more iterations: 1h 27m 17s, 500 more iterations: 7h 16m 27s. [2026-03-25 19:51:37,970][__main__][INFO] - Starting iteration 374. [2026-03-25 19:51:37,973][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 19:51:37,974][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:51:44,975][__main__][INFO] - Number of regex retries in iteration 374: 0 [2026-03-25 19:51:44,977][__main__][INFO] - agents played in iteration 374 are Bob, Alice [2026-03-25 19:51:45,854][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:51:45,920][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:51:45,921][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:51:45,921][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:51:46,634][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:51:47,248][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:51:47,908][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:51:48,568][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:51:49,222][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:51:49,881][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:51:50,540][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:51:51,200][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:51:51,858][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:51:52,517][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:51:53,175][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:51:53,834][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:51:54,492][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:51:55,151][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:51:55,809][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:51:56,466][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:51:57,124][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:51:57,783][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:51:58,441][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:51:59,099][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:51:59,756][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:52:00,415][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:52:01,073][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:52:01,730][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:52:02,387][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:52:03,045][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:52:03,704][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:52:04,362][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:52:05,020][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:52:05,679][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:52:06,338][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:52:06,996][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:52:07,654][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:52:08,312][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:52:08,969][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:52:09,627][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:52:10,285][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:52:10,942][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:52:11,600][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:52:12,259][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:52:12,917][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:52:13,575][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:52:14,233][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:52:14,891][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:52:15,549][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:52:16,206][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:52:16,864][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:52:17,522][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:52:18,509][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:52:19,169][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:52:19,828][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:52:20,486][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:52:21,143][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:52:21,802][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:52:22,460][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:52:23,118][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:52:23,776][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:52:24,437][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:52:25,096][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:52:25,755][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:52:26,413][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:52:27,071][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:52:27,731][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:52:28,388][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:52:29,047][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:52:29,814][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 19:52:31,192][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:52:31,194][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:52:31,196][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:52:32,641][__main__][INFO] - Iteration 375 took 54s (12.81% Gen, 84.54% Train). Generation: 7s, Training: 46s. Estimated remaining time: 9h 24m 42s. Estimated total time: 15h 11m 9s. Time estimates for 10 more iterations: 9m 6s, 100 more iterations: 1h 31m 6s, 500 more iterations: 7h 35m 34s. [2026-03-25 19:52:32,643][__main__][INFO] - Starting iteration 375. [2026-03-25 19:52:32,647][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 19:52:32,648][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:52:38,341][__main__][INFO] - Number of regex retries in iteration 375: 0 [2026-03-25 19:52:38,342][__main__][INFO] - agents played in iteration 375 are Bob, Alice [2026-03-25 19:52:39,446][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:52:39,507][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:52:39,508][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:52:39,508][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:52:40,336][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:52:40,949][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:52:41,609][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:52:42,267][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:52:42,925][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:52:43,584][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:52:44,244][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:52:44,902][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:52:45,559][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:52:46,221][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:52:46,881][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:52:47,539][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:52:48,198][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:52:48,856][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:52:49,514][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:52:50,173][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:52:50,830][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:52:51,490][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:52:52,148][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:52:52,806][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:52:53,464][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:52:54,123][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:52:54,782][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:52:55,439][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:52:56,099][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:52:56,758][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:52:57,416][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:52:58,074][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:52:58,733][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:52:59,391][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:53:00,049][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:53:00,706][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:53:01,369][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:53:02,029][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:53:02,688][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:53:03,349][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:53:04,010][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:53:04,671][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:53:05,332][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:53:05,992][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:53:06,656][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:53:07,315][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:53:07,975][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:53:08,633][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:53:09,291][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:53:09,949][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:53:10,607][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:53:11,263][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:53:12,253][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:53:12,912][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:53:13,570][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:53:14,228][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:53:14,888][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:53:15,546][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:53:16,204][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:53:16,862][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:53:17,520][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:53:18,180][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:53:18,839][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:53:19,498][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:53:20,160][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:53:20,819][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:53:21,477][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:53:22,135][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:53:22,794][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:53:23,624][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 19:53:25,020][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:53:25,023][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:53:25,024][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:53:26,494][__main__][INFO] - Iteration 376 took 53s (10.57% Gen, 86.69% Train). Generation: 5s, Training: 46s. Estimated remaining time: 9h 10m 7s. Estimated total time: 14h 57m 28s. Time estimates for 10 more iterations: 8m 58s, 100 more iterations: 1h 29m 44s, 500 more iterations: 7h 28m 44s. [2026-03-25 19:53:26,496][__main__][INFO] - Starting iteration 376. [2026-03-25 19:53:26,500][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 19:53:26,500][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:53:38,775][__main__][INFO] - Number of regex retries in iteration 376: 0 [2026-03-25 19:53:38,775][__main__][INFO] - agents played in iteration 376 are Bob, Alice [2026-03-25 19:53:39,280][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:53:39,343][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:53:39,344][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:53:39,344][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:53:40,008][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:53:40,625][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:53:41,285][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:53:41,946][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:53:42,604][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:53:43,265][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:53:43,924][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:53:44,584][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:53:45,244][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:53:45,905][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:53:46,563][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:53:47,224][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:53:47,882][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:53:48,541][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:53:49,201][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:53:49,860][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:53:50,520][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:53:51,182][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:53:51,842][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:53:52,501][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:53:53,160][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:53:53,820][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:53:54,479][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:53:55,138][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:53:55,797][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:53:56,455][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:53:57,114][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:53:57,774][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:53:58,434][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:53:59,093][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:53:59,752][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:54:00,412][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:54:01,072][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:54:01,731][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:54:02,390][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:54:03,049][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:54:03,708][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:54:04,368][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:54:05,028][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:54:05,687][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:54:06,346][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:54:07,005][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:54:07,664][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:54:08,323][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:54:08,982][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:54:09,641][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:54:10,300][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:54:10,960][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:54:11,938][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:54:12,601][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:54:13,259][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:54:13,920][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:54:14,579][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:54:15,239][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:54:15,898][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:54:16,556][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:54:17,216][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:54:17,874][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:54:18,532][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:54:19,190][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:54:19,848][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:54:20,506][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:54:21,164][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:54:21,822][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:54:22,480][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:54:23,270][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 19:54:24,735][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:54:24,738][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:54:24,739][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:54:26,151][__main__][INFO] - Iteration 377 took 59s (20.58% Gen, 77.05% Train). Generation: 12s, Training: 45s. Estimated remaining time: 10h 45m 52s. Estimated total time: 16h 34m 13s. Time estimates for 10 more iterations: 9m 56s, 100 more iterations: 1h 39m 25s, 500 more iterations: 8h 17m 6s. [2026-03-25 19:54:26,154][__main__][INFO] - Starting iteration 377. [2026-03-25 19:54:26,158][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 19:54:26,158][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:54:34,063][__main__][INFO] - Number of regex retries in iteration 377: 0 [2026-03-25 19:54:34,065][__main__][INFO] - agents played in iteration 377 are Bob, Alice [2026-03-25 19:54:34,583][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:54:34,645][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:54:34,645][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:54:34,646][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:54:35,507][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:54:37,040][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:54:37,701][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:54:38,359][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:54:39,017][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:54:39,676][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:54:40,338][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:54:40,997][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:54:41,655][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:54:42,315][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:54:42,973][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:54:43,633][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:54:44,291][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:54:44,949][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:54:45,868][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:54:46,526][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:54:47,185][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:54:47,843][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:54:48,501][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:54:49,159][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:54:49,817][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:54:50,475][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:54:51,134][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:54:51,792][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:54:52,453][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:54:53,112][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:54:53,770][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:54:54,428][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:54:55,085][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:54:55,745][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:54:56,404][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:54:57,062][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:54:57,721][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:54:58,380][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:54:59,038][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:54:59,696][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:55:00,355][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:55:01,015][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:55:01,674][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:55:02,331][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:55:02,989][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:55:03,648][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:55:04,307][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:55:04,965][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:55:05,624][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:55:06,285][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:55:06,944][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:55:07,602][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:55:08,597][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:55:09,256][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:55:09,914][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:55:10,573][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:55:11,232][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:55:11,890][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:55:12,549][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:55:13,207][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:55:13,866][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:55:14,965][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:55:16,276][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:55:16,934][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:55:17,593][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:55:18,253][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:55:18,911][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:55:19,570][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:55:20,227][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:55:21,132][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:45 [2026-03-25 19:55:22,591][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:55:22,593][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:55:22,595][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:55:24,749][__main__][INFO] - Iteration 378 took 58s (13.49% Gen, 82.82% Train). Generation: 7s, Training: 48s. Estimated remaining time: 10h 27m 14s. Estimated total time: 16h 16m 33s. Time estimates for 10 more iterations: 9m 45s, 100 more iterations: 1h 37m 39s, 500 more iterations: 8h 8m 16s. [2026-03-25 19:55:24,753][__main__][INFO] - Starting iteration 378. [2026-03-25 19:55:24,758][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 19:55:24,759][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:55:30,450][__main__][INFO] - Number of regex retries in iteration 378: 0 [2026-03-25 19:55:30,451][__main__][INFO] - agents played in iteration 378 are Bob, Alice [2026-03-25 19:55:31,317][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:55:31,378][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:55:31,379][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:55:31,379][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:55:32,226][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:55:32,839][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:55:33,501][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:55:34,160][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:55:34,819][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:55:35,480][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:55:36,139][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:55:36,799][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:55:37,459][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:55:38,119][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:55:38,778][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:55:39,438][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:55:40,097][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:55:40,756][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:55:41,417][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:55:42,076][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:55:42,735][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:55:43,393][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:55:44,052][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:55:44,712][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:55:45,371][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:55:46,029][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:55:46,689][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:55:47,349][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:55:48,008][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:55:48,666][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:55:49,325][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:55:49,984][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:55:50,643][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:55:51,302][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:55:51,962][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:55:52,621][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:55:53,280][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:55:53,940][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:55:54,600][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:55:55,259][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:55:55,918][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:55:56,577][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:55:57,236][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:55:57,895][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:55:58,555][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:55:59,214][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:55:59,873][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:56:00,532][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:56:01,191][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:56:01,850][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:56:02,509][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:56:03,168][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:56:04,165][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:56:04,824][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:56:05,483][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:56:06,142][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:56:06,800][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:56:07,458][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:56:08,117][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:56:08,775][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:56:09,434][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:56:10,092][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:56:10,750][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:56:11,409][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:56:12,067][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:56:12,726][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:56:13,384][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:56:14,042][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:56:14,700][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:56:15,473][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 19:56:16,826][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:56:16,829][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:56:16,830][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:56:18,166][__main__][INFO] - Iteration 379 took 53s (10.66% Gen, 86.84% Train). Generation: 5s, Training: 46s. Estimated remaining time: 8h 59m 57s. Estimated total time: 14h 50m 10s. Time estimates for 10 more iterations: 8m 54s, 100 more iterations: 1h 29m 1s, 500 more iterations: 7h 25m 5s. [2026-03-25 19:56:18,168][__main__][INFO] - Starting iteration 379. [2026-03-25 19:56:18,173][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 19:56:18,174][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:56:24,319][__main__][INFO] - Number of regex retries in iteration 379: 0 [2026-03-25 19:56:24,319][__main__][INFO] - agents played in iteration 379 are Bob, Alice [2026-03-25 19:56:25,389][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:56:25,450][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:56:25,451][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:56:25,451][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:56:26,120][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:56:26,742][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:56:27,404][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:56:28,065][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:56:28,724][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:56:29,385][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:56:30,046][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:56:30,707][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:56:31,365][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:56:32,027][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:56:32,688][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:56:33,347][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:56:34,006][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:56:34,665][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:56:35,326][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:56:35,985][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:56:36,643][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:56:37,302][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:56:37,961][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:56:38,621][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:56:39,281][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:56:40,501][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:56:41,159][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:56:41,817][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:56:42,475][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:56:43,137][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:56:43,799][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:56:44,455][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:56:45,116][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:56:45,773][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:56:46,432][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:56:47,090][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:56:47,747][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:56:48,406][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:56:49,064][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:56:49,721][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:56:50,380][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:56:51,037][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:56:51,695][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:56:52,354][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:56:53,013][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:56:53,672][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:56:54,330][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:56:54,989][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:56:55,648][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:56:56,306][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:56:56,965][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:56:57,623][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:56:58,610][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:56:59,271][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:56:59,931][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:57:00,589][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:57:01,247][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:57:01,905][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:57:02,566][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:57:03,223][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:57:03,882][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:57:04,542][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:57:05,201][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:57:05,860][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:57:06,518][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:57:07,176][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:57:07,836][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:57:08,499][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:57:09,157][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:57:09,883][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 19:57:11,329][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:57:11,332][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:57:11,333][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:57:13,762][__main__][INFO] - Iteration 380 took 55s (11.05% Gen, 84.57% Train). Generation: 6s, Training: 47s. Estimated remaining time: 9h 35m 22s. Estimated total time: 15h 26m 31s. Time estimates for 10 more iterations: 9m 15s, 100 more iterations: 1h 32m 39s, 500 more iterations: 7h 43m 15s. [2026-03-25 19:57:13,764][__main__][INFO] - Starting iteration 380. [2026-03-25 19:57:13,768][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 19:57:13,768][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:57:22,210][__main__][INFO] - Number of regex retries in iteration 380: 0 [2026-03-25 19:57:22,212][__main__][INFO] - agents played in iteration 380 are Bob, Alice [2026-03-25 19:57:22,701][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:57:22,763][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:57:22,764][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:57:22,764][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:57:23,428][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:57:24,046][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:57:24,706][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:57:25,366][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:57:26,026][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:57:26,685][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:57:27,345][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:57:28,003][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:57:28,663][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:57:29,322][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:57:29,981][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:57:30,641][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:57:31,300][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:57:31,959][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:57:32,618][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:57:33,277][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:57:33,936][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:57:34,595][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:57:35,255][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:57:35,915][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:57:36,574][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:57:37,233][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:57:37,891][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:57:38,550][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:57:39,209][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:57:39,868][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:57:40,530][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:57:41,190][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:57:41,851][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:57:42,511][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:57:43,171][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:57:43,830][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:57:44,490][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:57:45,149][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:57:45,807][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:57:46,466][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:57:47,124][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:57:47,783][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:57:48,442][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:57:49,101][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:57:49,761][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:57:50,420][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:57:51,079][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:57:51,738][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:57:52,396][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:57:53,055][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:57:53,714][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:57:54,373][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:57:55,352][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:57:56,012][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:57:56,669][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:57:57,327][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:57:57,985][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:57:58,645][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:57:59,303][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:57:59,962][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:58:00,619][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:58:01,278][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:58:01,935][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:58:02,593][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:58:03,251][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:58:03,908][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:58:04,568][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:58:05,226][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:58:05,885][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:58:06,611][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 19:58:07,949][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:58:07,952][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:58:07,953][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:58:09,458][__main__][INFO] - Iteration 381 took 55s (15.16% Gen, 82.13% Train). Generation: 8s, Training: 45s. Estimated remaining time: 9h 36m 8s. Estimated total time: 15h 28m 12s. Time estimates for 10 more iterations: 9m 16s, 100 more iterations: 1h 32m 49s, 500 more iterations: 7h 44m 6s. [2026-03-25 19:58:09,461][__main__][INFO] - Starting iteration 381. [2026-03-25 19:58:09,466][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 19:58:09,466][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:58:19,625][__main__][INFO] - Number of regex retries in iteration 381: 0 [2026-03-25 19:58:19,626][__main__][INFO] - agents played in iteration 381 are Bob, Alice [2026-03-25 19:58:20,159][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:58:20,220][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:58:20,220][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:58:20,221][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:58:20,896][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:58:21,522][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:58:22,183][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:58:22,842][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:58:23,501][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:58:24,159][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:58:24,819][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:58:25,478][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:58:26,138][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:58:26,796][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:58:27,454][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:58:28,113][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:58:28,772][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:58:29,430][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:58:30,089][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:58:30,748][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:58:31,410][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:58:32,069][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:58:32,729][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:58:33,387][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:58:34,045][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:58:34,704][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:58:35,363][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:58:36,023][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:58:36,682][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:58:37,340][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:58:37,999][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:58:38,658][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:58:39,316][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:58:39,975][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:58:40,633][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:58:41,297][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:58:41,957][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:58:42,615][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:58:43,274][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:58:43,933][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:58:44,592][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:58:45,251][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:58:45,910][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:58:46,569][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:58:47,228][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:58:47,887][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:58:48,545][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:58:49,204][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:58:49,863][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:58:50,522][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:58:51,183][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:58:51,841][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:58:52,829][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:58:53,487][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:58:54,146][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:58:54,804][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:58:55,463][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:58:56,122][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:58:56,779][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:58:57,437][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:58:58,095][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:58:58,754][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:58:59,413][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:59:00,070][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:59:00,728][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:59:01,386][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:59:02,045][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:59:02,703][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:59:03,360][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:59:04,142][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 19:59:05,869][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:59:05,872][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:59:05,873][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:59:07,268][__main__][INFO] - Iteration 382 took 57s (17.58% Gen, 80.00% Train). Generation: 10s, Training: 46s. Estimated remaining time: 10h 10m 23s. Estimated total time: 16h 3m 24s. Time estimates for 10 more iterations: 9m 38s, 100 more iterations: 1h 36m 20s, 500 more iterations: 8h 1m 42s. [2026-03-25 19:59:07,270][__main__][INFO] - Starting iteration 382. [2026-03-25 19:59:07,274][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 19:59:07,274][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:59:13,039][__main__][INFO] - Number of regex retries in iteration 382: 0 [2026-03-25 19:59:13,040][__main__][INFO] - agents played in iteration 382 are Bob, Alice [2026-03-25 19:59:14,108][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:59:14,169][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 19:59:14,170][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:59:14,171][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:59:15,025][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:59:15,635][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:59:16,298][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:59:16,958][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:59:17,616][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:59:18,276][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:59:18,933][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:59:19,591][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:59:20,250][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:59:20,909][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:59:21,566][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:59:22,224][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:59:22,882][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:59:23,540][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:59:24,197][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:59:24,855][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:59:25,513][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:59:26,171][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:59:26,829][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:59:29,032][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:59:29,690][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:59:30,348][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:59:31,006][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:59:31,664][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:59:32,322][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:59:32,980][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:59:33,637][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:59:34,295][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:59:34,953][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:59:35,612][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:59:36,270][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:59:36,928][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:59:37,586][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:59:38,243][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:59:38,901][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:59:39,558][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:59:40,217][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:59:40,875][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:59:41,533][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:59:42,190][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:59:42,848][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:59:43,507][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:59:44,165][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:59:44,823][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:59:45,481][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:59:46,138][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:59:46,796][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:59:47,453][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:59:48,444][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:59:49,104][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:59:49,762][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:59:50,422][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:59:51,080][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:59:51,738][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:59:52,397][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:59:53,060][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:59:53,717][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:59:54,376][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:59:55,035][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:59:55,694][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:59:56,352][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:59:57,011][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:59:57,670][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:59:58,328][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:59:58,988][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:59:59,775][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:44 [2026-03-25 20:00:01,447][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:00:01,450][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:00:01,451][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:00:02,787][__main__][INFO] - Iteration 383 took 55s (10.39% Gen, 87.20% Train). Generation: 5s, Training: 48s. Estimated remaining time: 9h 31m 17s. Estimated total time: 15h 25m 15s. Time estimates for 10 more iterations: 9m 15s, 100 more iterations: 1h 32m 31s, 500 more iterations: 7h 42m 37s. [2026-03-25 20:00:02,789][__main__][INFO] - Starting iteration 383. [2026-03-25 20:00:02,794][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 20:00:02,794][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:00:10,681][__main__][INFO] - Number of regex retries in iteration 383: 0 [2026-03-25 20:00:10,682][__main__][INFO] - agents played in iteration 383 are Bob, Alice [2026-03-25 20:00:11,206][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:00:11,347][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:00:11,347][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:00:11,348][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:00:12,020][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:00:12,624][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:00:15,798][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:00:16,456][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:00:17,247][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:00:18,025][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:00:18,684][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:00:19,477][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:00:20,136][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:00:20,793][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:00:21,452][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:00:22,109][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:00:22,767][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:00:23,424][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:00:26,094][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:00:26,753][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:00:27,411][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:00:28,070][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:00:28,897][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:00:29,554][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:00:30,211][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:00:30,869][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:00:31,527][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:00:32,184][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:00:32,844][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:00:33,502][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:00:34,159][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:00:34,816][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:00:35,473][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:00:36,130][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:00:37,091][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:00:37,749][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:00:38,407][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:00:39,065][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:00:39,800][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:00:40,457][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:00:41,215][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:00:41,873][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:00:42,532][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:00:43,190][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:00:43,848][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:00:44,506][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:00:46,237][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:00:46,922][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:00:47,580][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:00:49,831][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:00:50,489][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:00:51,147][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:00:52,133][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:00:52,791][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:00:53,449][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:00:54,113][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:00:54,771][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:00:55,429][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:00:56,088][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:00:56,746][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:00:57,403][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:00:58,063][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:00:58,723][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:00:59,381][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:01:00,039][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:01:00,696][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:01:01,359][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:01:02,020][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:01:02,678][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:01:03,551][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:51 [2026-03-25 20:01:04,862][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:01:04,865][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:01:04,866][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:01:06,203][__main__][INFO] - Iteration 384 took 1m 3s (12.44% Gen, 85.45% Train). Generation: 7s, Training: 54s. Estimated remaining time: 11h 41m 49s. Estimated total time: 17h 36m 50s. Time estimates for 10 more iterations: 10m 34s, 100 more iterations: 1h 45m 41s, 500 more iterations: 8h 48m 25s. [2026-03-25 20:01:06,205][__main__][INFO] - Starting iteration 384. [2026-03-25 20:01:06,209][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 20:01:06,210][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:01:12,928][__main__][INFO] - Number of regex retries in iteration 384: 0 [2026-03-25 20:01:12,930][__main__][INFO] - agents played in iteration 384 are Bob, Alice [2026-03-25 20:01:13,415][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:01:13,481][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:01:13,481][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:01:13,482][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:01:14,182][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:01:14,790][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:01:15,450][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:01:16,109][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:01:16,770][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:01:17,430][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:01:18,090][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:01:18,749][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:01:19,408][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:01:20,069][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:01:22,309][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:01:22,966][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:01:23,624][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:01:24,282][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:01:24,940][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:01:25,598][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:01:26,257][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:01:26,915][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:01:27,574][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:01:28,232][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:01:28,891][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:01:29,549][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:01:30,207][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:01:30,865][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:01:31,524][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:01:32,182][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:01:32,840][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:01:33,497][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:01:34,155][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:01:34,813][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:01:35,471][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:01:36,131][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:01:36,788][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:01:37,446][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:01:38,103][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:01:38,761][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:01:39,419][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:01:40,077][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:01:40,735][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:01:41,398][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:01:42,055][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:01:42,713][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:01:43,372][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:01:44,030][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:01:44,688][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:01:45,347][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:01:46,006][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:01:46,664][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:01:47,652][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:01:48,309][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:01:48,967][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:01:49,626][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:01:50,284][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:01:50,943][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:01:51,601][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:01:52,262][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:01:52,921][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:01:53,580][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:01:54,238][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:01:54,897][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:01:55,559][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:01:56,218][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:01:56,877][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:01:57,535][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:01:58,193][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:01:58,961][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:44 [2026-03-25 20:02:00,281][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:02:00,284][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:02:00,581][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:02:01,918][__main__][INFO] - Iteration 385 took 55s (12.06% Gen, 85.53% Train). Generation: 6s, Training: 47s. Estimated remaining time: 9h 32m 34s. Estimated total time: 15h 28m 30s. Time estimates for 10 more iterations: 9m 17s, 100 more iterations: 1h 32m 51s, 500 more iterations: 7h 44m 15s. [2026-03-25 20:02:01,920][__main__][INFO] - Starting iteration 385. [2026-03-25 20:02:01,924][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 20:02:01,925][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:02:17,664][__main__][INFO] - Number of regex retries in iteration 385: 0 [2026-03-25 20:02:17,665][__main__][INFO] - agents played in iteration 385 are Bob, Alice [2026-03-25 20:02:18,554][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:02:18,617][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:02:18,617][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:02:18,618][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:02:19,362][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:02:19,977][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:02:20,636][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:02:21,292][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:02:21,950][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:02:22,608][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:02:23,265][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:02:23,924][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:02:24,581][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:02:25,247][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:02:25,904][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:02:26,564][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:02:27,222][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:02:27,879][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:02:28,538][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:02:29,196][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:02:29,856][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:02:30,513][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:02:31,171][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:02:31,828][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:02:32,485][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:02:33,144][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:02:33,802][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:02:34,461][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:02:35,118][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:02:35,776][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:02:36,433][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:02:37,091][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:02:37,750][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:02:38,408][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:02:39,066][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:02:39,723][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:02:40,383][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:02:41,043][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:02:41,701][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:02:42,360][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:02:43,018][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:02:43,676][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:02:44,334][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:02:44,993][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:02:45,652][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:02:46,312][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:02:46,971][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:02:47,629][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:02:48,289][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:02:48,947][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:02:49,606][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:02:50,265][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:02:51,254][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:02:51,914][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:02:52,572][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:02:53,776][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:02:54,434][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:02:55,096][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:02:55,757][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:02:56,415][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:02:57,074][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:02:57,730][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:02:58,390][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:02:59,050][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:02:59,710][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:03:00,371][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:03:01,031][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:03:01,690][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:03:02,350][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:03:03,171][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 20:03:04,617][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:03:04,619][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:03:04,621][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:03:06,484][__main__][INFO] - Iteration 386 took 1m 4s (24.38% Gen, 72.73% Train). Generation: 15s, Training: 46s. Estimated remaining time: 11h 59m 0s. Estimated total time: 17h 56m 1s. Time estimates for 10 more iterations: 10m 45s, 100 more iterations: 1h 47m 36s, 500 more iterations: 8h 58m 0s. [2026-03-25 20:03:06,486][__main__][INFO] - Starting iteration 386. [2026-03-25 20:03:06,489][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 20:03:06,490][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:03:19,547][__main__][INFO] - Number of regex retries in iteration 386: 0 [2026-03-25 20:03:19,548][__main__][INFO] - agents played in iteration 386 are Bob, Alice [2026-03-25 20:03:20,118][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:03:20,180][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:03:20,180][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:03:20,181][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:03:20,840][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:03:21,453][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:03:22,113][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:03:22,774][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:03:23,430][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:03:24,087][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:03:24,745][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:03:25,402][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:03:26,060][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:03:26,718][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:03:28,092][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:03:28,754][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:03:29,415][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:03:30,074][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:03:30,739][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:03:31,398][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:03:32,054][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:03:32,714][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:03:33,377][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:03:34,039][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:03:34,699][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:03:35,357][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:03:36,017][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:03:36,676][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:03:37,334][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:03:37,993][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:03:38,651][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:03:39,311][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:03:39,970][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:03:40,629][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:03:41,288][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:03:41,951][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:03:42,608][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:03:43,267][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:03:43,926][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:03:44,583][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:03:45,242][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:03:45,900][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:03:46,556][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:03:47,214][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:03:47,875][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:03:48,532][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:03:49,189][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:03:49,852][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:03:50,508][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:03:51,166][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:03:51,825][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:03:52,482][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:03:53,472][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:03:54,131][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:03:54,789][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:03:55,447][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:03:56,105][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:03:56,762][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:03:57,421][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:03:58,078][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:03:58,736][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:03:59,396][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:04:00,054][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:04:00,712][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:04:01,369][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:04:02,027][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:04:02,685][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:04:03,343][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:04:04,002][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:04:04,776][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 20:04:06,144][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:04:06,147][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:04:06,148][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:04:07,685][__main__][INFO] - Iteration 387 took 1m 1s (21.34% Gen, 76.15% Train). Generation: 13s, Training: 46s. Estimated remaining time: 11h 1m 55s. Estimated total time: 16h 59m 57s. Time estimates for 10 more iterations: 10m 11s, 100 more iterations: 1h 41m 59s, 500 more iterations: 8h 29m 58s. [2026-03-25 20:04:07,687][__main__][INFO] - Starting iteration 387. [2026-03-25 20:04:07,691][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 20:04:07,692][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:04:12,505][__main__][INFO] - Number of regex retries in iteration 387: 0 [2026-03-25 20:04:12,506][__main__][INFO] - agents played in iteration 387 are Bob, Alice [2026-03-25 20:04:13,028][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:04:13,090][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:04:13,090][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:04:13,091][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:04:13,762][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:04:14,389][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:04:15,048][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:04:15,707][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:04:16,365][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:04:17,023][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:04:17,682][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:04:18,339][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:04:18,997][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:04:19,656][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:04:20,314][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:04:20,972][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:04:21,630][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:04:22,288][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:04:22,947][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:04:23,605][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:04:24,264][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:04:24,923][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:04:25,581][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:04:26,239][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:04:26,897][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:04:27,555][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:04:28,214][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:04:28,873][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:04:29,530][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:04:30,188][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:04:30,846][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:04:31,504][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:04:32,162][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:04:32,819][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:04:33,478][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:04:34,135][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:04:34,793][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:04:35,452][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:04:36,110][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:04:36,768][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:04:37,427][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:04:38,084][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:04:38,742][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:04:39,401][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:04:40,059][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:04:40,717][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:04:41,376][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:04:42,034][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:04:44,611][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:04:45,271][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:04:45,929][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:04:46,588][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:04:47,586][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:04:48,245][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:04:48,905][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:04:49,564][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:04:50,221][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:04:50,879][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:04:51,537][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:04:52,195][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:04:52,853][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:04:53,511][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:04:54,168][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:04:54,826][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:04:55,484][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:04:56,143][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:04:56,801][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:04:57,459][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:04:58,116][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:04:58,883][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:45 [2026-03-25 20:05:00,569][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:05:00,572][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:05:00,573][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:05:02,026][__main__][INFO] - Iteration 388 took 54s (8.86% Gen, 88.46% Train). Generation: 4s, Training: 48s. Estimated remaining time: 9h 6m 39s. Estimated total time: 15h 5m 36s. Time estimates for 10 more iterations: 9m 3s, 100 more iterations: 1h 30m 33s, 500 more iterations: 7h 32m 48s. [2026-03-25 20:05:02,028][__main__][INFO] - Starting iteration 388. [2026-03-25 20:05:02,033][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 20:05:02,033][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:05:10,509][__main__][INFO] - Number of regex retries in iteration 388: 0 [2026-03-25 20:05:12,252][__main__][INFO] - agents played in iteration 388 are Bob, Alice [2026-03-25 20:05:12,772][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:05:12,833][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:05:12,834][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:05:12,835][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:05:13,732][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:05:14,356][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:05:15,017][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:05:15,676][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:05:16,334][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:05:16,994][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:05:17,652][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:05:18,312][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:05:19,097][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:05:19,755][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:05:20,415][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:05:21,073][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:05:21,731][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:05:22,390][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:05:23,048][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:05:23,709][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:05:24,369][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:05:25,029][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:05:25,688][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:05:27,818][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:05:28,476][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:05:29,133][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:05:29,790][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:05:30,447][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:05:31,104][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:05:31,761][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:05:32,420][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:05:33,078][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:05:33,736][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:05:34,393][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:05:35,051][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:05:35,708][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:05:36,366][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:05:37,023][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:05:37,681][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:05:38,339][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:05:38,998][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:05:39,655][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:05:40,313][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:05:40,970][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:05:41,628][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:05:42,285][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:05:42,943][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:05:43,601][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:05:44,258][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:05:44,916][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:05:45,574][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:05:46,231][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:05:47,215][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:05:47,875][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:05:48,534][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:05:49,191][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:05:49,850][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:05:50,510][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:05:51,168][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:05:51,826][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:05:52,484][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:05:53,142][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:05:53,799][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:05:54,457][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:05:55,118][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:05:55,777][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:05:56,436][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:05:57,094][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:05:57,751][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:05:58,535][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:44 [2026-03-25 20:05:59,930][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:05:59,933][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:05:59,963][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:06:01,583][__main__][INFO] - Iteration 389 took 59s (17.16% Gen, 80.12% Train). Generation: 10s, Training: 47s. Estimated remaining time: 10h 32m 35s. Estimated total time: 16h 32m 32s. Time estimates for 10 more iterations: 9m 55s, 100 more iterations: 1h 39m 15s, 500 more iterations: 8h 16m 16s. [2026-03-25 20:06:01,585][__main__][INFO] - Starting iteration 389. [2026-03-25 20:06:01,589][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 20:06:01,589][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:06:20,391][__main__][INFO] - Number of regex retries in iteration 389: 0 [2026-03-25 20:06:20,393][__main__][INFO] - agents played in iteration 389 are Bob, Alice [2026-03-25 20:06:20,931][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:06:21,000][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:06:21,001][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:06:21,002][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:06:21,787][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:06:22,400][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:06:23,058][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:06:23,715][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:06:24,372][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:06:25,030][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:06:25,687][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:06:26,345][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:06:27,004][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:06:27,661][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:06:28,319][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:06:28,978][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:06:29,635][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:06:30,295][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:06:30,953][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:06:32,877][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:06:34,113][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:06:34,770][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:06:35,427][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:06:36,085][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:06:36,742][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:06:37,399][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:06:38,057][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:06:38,714][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:06:39,371][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:06:40,029][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:06:40,686][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:06:41,344][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:06:42,002][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:06:42,659][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:06:43,317][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:06:43,976][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:06:44,634][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:06:45,291][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:06:45,949][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:06:46,606][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:06:47,264][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:06:47,922][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:06:48,579][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:06:49,237][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:06:49,895][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:06:50,553][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:06:51,210][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:06:51,873][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:06:52,530][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:06:53,191][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:06:53,848][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:06:54,506][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:06:55,500][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:06:56,161][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:06:56,823][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:06:57,478][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:06:58,137][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:06:58,795][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:06:59,452][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:07:00,110][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:07:00,768][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:07:01,428][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:07:02,087][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:07:02,744][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:07:03,403][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:07:04,061][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:07:04,719][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:07:05,378][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:07:06,037][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:07:06,809][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:45 [2026-03-25 20:07:08,160][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:07:08,163][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:07:08,164][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:07:09,480][__main__][INFO] - Iteration 390 took 1m 7s (27.70% Gen, 70.36% Train). Generation: 18s, Training: 47s. Estimated remaining time: 12h 50m 28s. Estimated total time: 18h 51m 32s. Time estimates for 10 more iterations: 11m 18s, 100 more iterations: 1h 53m 9s, 500 more iterations: 9h 25m 46s. [2026-03-25 20:07:09,483][__main__][INFO] - Starting iteration 390. [2026-03-25 20:07:09,488][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 20:07:09,488][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:07:26,149][__main__][INFO] - Number of regex retries in iteration 390: 0 [2026-03-25 20:07:26,150][__main__][INFO] - agents played in iteration 390 are Bob, Alice [2026-03-25 20:07:26,760][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:07:26,822][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:07:26,823][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:07:26,823][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:07:27,599][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:07:28,240][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:07:28,899][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:07:29,558][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:07:30,217][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:07:30,876][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:07:31,534][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:07:32,192][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:07:32,850][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:07:33,507][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:07:34,166][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:07:34,823][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:07:35,482][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:07:36,140][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:07:36,798][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:07:37,456][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:07:38,114][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:07:38,771][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:07:39,429][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:07:40,087][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:07:40,745][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:07:41,401][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:07:42,059][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:07:42,717][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:07:43,375][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:07:44,034][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:07:44,691][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:07:45,348][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:07:46,007][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:07:46,665][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:07:47,323][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:07:47,981][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:07:48,637][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:07:49,298][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:07:49,956][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:07:50,615][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:07:51,272][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:07:51,931][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:07:52,590][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:07:53,248][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:07:53,905][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:07:54,563][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:07:55,222][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:07:55,880][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:07:56,538][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:07:57,196][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:07:57,854][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:07:58,512][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:07:59,500][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:08:00,159][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:08:00,818][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:08:01,476][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:08:02,134][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:08:02,792][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:08:03,449][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:08:04,108][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:08:04,767][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:08:05,427][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:08:06,085][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:08:06,743][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:08:07,401][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:08:08,058][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:08:08,716][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:08:09,374][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:08:10,032][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:08:10,808][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 20:08:12,220][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:08:12,223][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:08:12,224][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:08:13,566][__main__][INFO] - Iteration 391 took 1m 4s (26.00% Gen, 71.90% Train). Generation: 16s, Training: 46s. Estimated remaining time: 11h 45m 52s. Estimated total time: 17h 48m 0s. Time estimates for 10 more iterations: 10m 40s, 100 more iterations: 1h 46m 48s, 500 more iterations: 8h 54m 0s. [2026-03-25 20:08:13,568][__main__][INFO] - Starting iteration 391. [2026-03-25 20:08:13,574][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 20:08:13,574][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:08:18,342][__main__][INFO] - Number of regex retries in iteration 391: 0 [2026-03-25 20:08:18,343][__main__][INFO] - agents played in iteration 391 are Bob, Alice [2026-03-25 20:08:18,829][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:08:18,891][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:08:18,892][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:08:18,892][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:08:19,644][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:08:20,249][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:08:20,910][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:08:21,568][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:08:22,228][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:08:22,889][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:08:23,546][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:08:24,205][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:08:24,862][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:08:25,521][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:08:26,179][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:08:26,837][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:08:27,495][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:08:28,153][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:08:28,812][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:08:29,469][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:08:30,127][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:08:30,786][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:08:31,448][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:08:32,106][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:08:32,764][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:08:33,422][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:08:34,082][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:08:34,740][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:08:35,398][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:08:36,057][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:08:36,714][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:08:37,372][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:08:38,031][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:08:38,689][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:08:39,348][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:08:40,006][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:08:40,665][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:08:41,324][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:08:41,982][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:08:42,641][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:08:43,300][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:08:43,959][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:08:44,617][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:08:45,275][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:08:45,932][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:08:46,591][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:08:47,249][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:08:47,907][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:08:48,565][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:08:49,224][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:08:49,883][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:08:50,542][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:08:51,526][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:08:53,441][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:08:54,099][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:08:54,758][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:08:55,416][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:08:56,075][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:08:56,733][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:08:57,391][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:08:58,048][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:08:58,706][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:08:59,363][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:09:00,022][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:09:00,680][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:09:01,339][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:09:02,001][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:09:02,662][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:09:03,320][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:09:04,204][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:44 [2026-03-25 20:09:05,752][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:09:05,755][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:09:05,756][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:09:07,240][__main__][INFO] - Iteration 392 took 53s (8.89% Gen, 88.34% Train). Generation: 4s, Training: 47s. Estimated remaining time: 8h 51m 27s. Estimated total time: 14h 54m 28s. Time estimates for 10 more iterations: 8m 56s, 100 more iterations: 1h 29m 26s, 500 more iterations: 7h 27m 14s. [2026-03-25 20:09:07,243][__main__][INFO] - Starting iteration 392. [2026-03-25 20:09:07,247][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 20:09:07,248][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:09:15,407][__main__][INFO] - Number of regex retries in iteration 392: 0 [2026-03-25 20:09:15,408][__main__][INFO] - agents played in iteration 392 are Bob, Alice [2026-03-25 20:09:16,051][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:09:16,115][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:09:16,115][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:09:16,116][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:09:16,799][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:09:17,426][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:09:18,085][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:09:18,744][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:09:19,402][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:09:20,060][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:09:20,720][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:09:21,378][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:09:22,036][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:09:22,694][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:09:23,352][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:09:24,011][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:09:24,669][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:09:25,327][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:09:25,985][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:09:26,642][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:09:27,299][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:09:27,958][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:09:28,617][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:09:29,276][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:09:29,935][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:09:30,593][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:09:31,251][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:09:31,909][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:09:32,566][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:09:33,224][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:09:33,882][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:09:34,539][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:09:35,198][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:09:35,856][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:09:36,513][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:09:37,171][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:09:37,828][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:09:38,485][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:09:39,143][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:09:39,801][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:09:40,458][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:09:41,116][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:09:41,773][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:09:42,431][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:09:43,089][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:09:43,746][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:09:44,406][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:09:45,064][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:09:45,725][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:09:46,382][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:09:47,040][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:09:47,700][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:09:48,694][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:09:49,354][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:09:50,013][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:09:50,672][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:09:51,331][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:09:51,992][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:09:52,651][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:09:53,308][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:09:53,967][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:09:54,624][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:09:55,284][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:09:55,942][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:09:56,601][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:09:57,260][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:09:57,917][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:09:58,577][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:09:59,235][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:09:59,952][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 20:10:01,472][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:10:01,475][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:10:01,507][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:10:03,272][__main__][INFO] - Iteration 393 took 56s (14.57% Gen, 82.28% Train). Generation: 8s, Training: 46s. Estimated remaining time: 9h 29m 49s. Estimated total time: 15h 33m 47s. Time estimates for 10 more iterations: 9m 20s, 100 more iterations: 1h 33m 22s, 500 more iterations: 7h 46m 53s. [2026-03-25 20:10:06,464][__main__][INFO] - Starting iteration 393. [2026-03-25 20:10:06,469][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 20:10:06,470][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:10:14,039][__main__][INFO] - Number of regex retries in iteration 393: 0 [2026-03-25 20:10:14,041][__main__][INFO] - agents played in iteration 393 are Bob, Alice [2026-03-25 20:10:14,528][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:10:14,590][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:10:14,591][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:10:14,592][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:10:15,281][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:10:15,893][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:10:16,551][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:10:17,214][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:10:17,872][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:10:18,530][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:10:19,188][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:10:19,845][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:10:20,505][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:10:21,162][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:10:21,819][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:10:22,477][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:10:23,140][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:10:23,796][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:10:24,455][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:10:25,115][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:10:25,772][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:10:26,430][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:10:27,088][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:10:27,746][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:10:28,404][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:10:29,062][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:10:29,720][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:10:30,377][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:10:31,035][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:10:31,694][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:10:32,352][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:10:33,009][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:10:33,667][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:10:34,324][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:10:34,981][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:10:35,639][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:10:36,297][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:10:36,955][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:10:37,614][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:10:38,272][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:10:38,929][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:10:39,588][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:10:40,248][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:10:40,906][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:10:41,565][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:10:42,226][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:10:42,884][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:10:43,543][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:10:44,201][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:10:44,858][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:10:45,517][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:10:46,175][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:10:47,171][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:10:47,830][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:10:48,490][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:10:49,147][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:10:49,808][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:10:50,465][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:10:51,123][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:10:51,781][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:10:52,439][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:10:53,097][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:10:53,754][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:10:54,413][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:10:55,072][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:10:55,730][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:10:56,388][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:10:57,045][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:10:57,703][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:10:58,613][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 20:10:59,984][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:10:59,987][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:10:59,988][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:11:01,654][__main__][INFO] - Iteration 394 took 55s (13.72% Gen, 83.26% Train). Generation: 7s, Training: 45s. Estimated remaining time: 9h 14m 50s. Estimated total time: 15h 19m 46s. Time estimates for 10 more iterations: 9m 11s, 100 more iterations: 1h 31m 58s, 500 more iterations: 7h 39m 53s. [2026-03-25 20:11:01,656][__main__][INFO] - Starting iteration 394. [2026-03-25 20:11:01,659][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 20:11:01,660][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:11:07,832][__main__][INFO] - Number of regex retries in iteration 394: 0 [2026-03-25 20:11:07,833][__main__][INFO] - agents played in iteration 394 are Bob, Alice [2026-03-25 20:11:08,450][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:11:08,511][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:11:08,512][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:11:08,512][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:11:09,373][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:11:10,001][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:11:10,662][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:11:11,324][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:11:11,983][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:11:12,644][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:11:13,303][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:11:13,964][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:11:14,622][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:11:15,281][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:11:15,940][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:11:16,599][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:11:17,258][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:11:17,916][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:11:18,577][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:11:19,235][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:11:19,893][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:11:20,552][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:11:21,212][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:11:21,872][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:11:22,532][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:11:23,193][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:11:23,853][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:11:24,512][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:11:25,170][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:11:25,829][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:11:26,487][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:11:27,146][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:11:27,805][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:11:28,465][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:11:29,124][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:11:29,783][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:11:30,442][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:11:31,101][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:11:31,759][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:11:32,419][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:11:33,077][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:11:33,736][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:11:34,395][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:11:35,054][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:11:35,712][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:11:36,376][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:11:37,034][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:11:37,693][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:11:38,351][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:11:39,010][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:11:39,668][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:11:40,327][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:11:41,324][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:11:41,983][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:11:42,641][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:11:43,300][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:11:43,957][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:11:44,617][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:11:45,276][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:11:45,934][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:11:46,592][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:11:47,251][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:11:47,911][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:11:48,569][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:11:49,227][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:11:49,886][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:11:50,545][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:11:51,202][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:11:51,860][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:11:52,717][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 20:11:54,042][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:11:54,045][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:11:54,046][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:11:56,534][__main__][INFO] - Iteration 395 took 54s (11.25% Gen, 84.21% Train). Generation: 6s, Training: 46s. Estimated remaining time: 9h 8m 45s. Estimated total time: 15h 14m 36s. Time estimates for 10 more iterations: 9m 8s, 100 more iterations: 1h 31m 27s, 500 more iterations: 7h 37m 18s. [2026-03-25 20:11:56,537][__main__][INFO] - Starting iteration 395. [2026-03-25 20:11:56,541][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 20:11:56,541][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:12:01,622][__main__][INFO] - Number of regex retries in iteration 395: 0 [2026-03-25 20:12:01,622][__main__][INFO] - agents played in iteration 395 are Bob, Alice [2026-03-25 20:12:02,108][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:12:02,170][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:12:02,171][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:12:02,171][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:12:02,877][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:12:03,493][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:12:04,153][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:12:04,812][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:12:05,470][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:12:06,128][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:12:06,787][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:12:07,446][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:12:08,106][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:12:08,764][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:12:09,425][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:12:10,084][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:12:10,745][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:12:11,404][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:12:12,063][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:12:12,723][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:12:13,382][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:12:14,042][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:12:14,701][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:12:15,362][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:12:16,021][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:12:16,679][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:12:17,338][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:12:17,997][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:12:18,660][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:12:19,319][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:12:19,980][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:12:20,642][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:12:21,302][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:12:21,960][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:12:22,620][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:12:23,279][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:12:23,938][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:12:25,508][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:12:26,173][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:12:26,834][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:12:27,492][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:12:28,150][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:12:28,809][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:12:29,471][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:12:30,129][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:12:30,787][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:12:31,446][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:12:32,104][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:12:32,762][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:12:33,420][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:12:34,080][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:12:34,738][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:12:35,728][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:12:36,386][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:12:37,045][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:12:37,704][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:12:38,362][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:12:39,021][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:12:39,680][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:12:40,341][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:12:40,999][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:12:41,657][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:12:42,316][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:12:42,974][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:12:43,632][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:12:44,290][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:12:44,948][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:12:45,607][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:12:46,266][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:12:47,044][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:44 [2026-03-25 20:12:48,386][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:12:48,388][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:12:48,390][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:12:49,874][__main__][INFO] - Iteration 396 took 53s (9.53% Gen, 87.68% Train). Generation: 5s, Training: 46s. Estimated remaining time: 8h 42m 10s. Estimated total time: 14h 48m 55s. Time estimates for 10 more iterations: 8m 53s, 100 more iterations: 1h 28m 53s, 500 more iterations: 7h 24m 27s. [2026-03-25 20:12:49,876][__main__][INFO] - Starting iteration 396. [2026-03-25 20:12:49,881][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 20:12:49,881][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:12:56,610][__main__][INFO] - Number of regex retries in iteration 396: 0 [2026-03-25 20:12:56,611][__main__][INFO] - agents played in iteration 396 are Bob, Alice [2026-03-25 20:12:57,410][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:12:57,472][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:12:57,472][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:12:57,473][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:12:58,308][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:12:58,917][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:12:59,577][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:13:00,236][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:13:00,894][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:13:01,553][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:13:02,211][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:13:02,870][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:13:03,528][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:13:04,187][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:13:04,845][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:13:05,503][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:13:06,162][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:13:06,822][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:13:07,481][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:13:08,141][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:13:08,799][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:13:09,457][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:13:10,115][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:13:10,772][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:13:11,430][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:13:12,089][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:13:12,747][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:13:13,405][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:13:14,063][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:13:14,720][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:13:15,379][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:13:16,037][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:13:16,694][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:13:17,352][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:13:18,010][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:13:18,668][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:13:19,325][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:13:19,983][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:13:20,640][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:13:21,298][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:13:21,957][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:13:22,615][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:13:23,285][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:13:23,945][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:13:24,603][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:13:25,261][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:13:25,919][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:13:26,578][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:13:27,236][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:13:27,894][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:13:28,554][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:13:29,213][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:13:30,194][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:13:30,853][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:13:31,512][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:13:32,171][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:13:32,829][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:13:33,487][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:13:34,145][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:13:34,804][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:13:35,462][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:13:36,120][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:13:36,779][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:13:37,438][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:13:38,098][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:13:38,757][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:13:39,414][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:13:40,072][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:13:40,730][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:13:41,542][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 20:13:43,834][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:13:43,836][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:13:43,837][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:13:45,314][__main__][INFO] - Iteration 397 took 55s (12.14% Gen, 85.19% Train). Generation: 6s, Training: 47s. Estimated remaining time: 9h 16m 15s. Estimated total time: 15h 23m 55s. Time estimates for 10 more iterations: 9m 14s, 100 more iterations: 1h 32m 23s, 500 more iterations: 7h 41m 57s. [2026-03-25 20:13:45,316][__main__][INFO] - Starting iteration 397. [2026-03-25 20:13:45,320][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 20:13:45,320][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:13:49,988][__main__][INFO] - Number of regex retries in iteration 397: 0 [2026-03-25 20:13:49,989][__main__][INFO] - agents played in iteration 397 are Bob, Alice [2026-03-25 20:13:50,477][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:13:50,539][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:13:50,539][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:13:50,540][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:13:51,237][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:13:51,850][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:13:52,508][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:13:53,168][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:13:53,827][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:13:54,486][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:13:55,145][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:13:55,805][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:13:56,466][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:13:57,125][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:13:57,783][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:13:58,443][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:13:59,102][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:13:59,761][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:14:00,421][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:14:01,080][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:14:02,679][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:14:03,336][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:14:03,995][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:14:04,654][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:14:05,312][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:14:05,970][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:14:06,629][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:14:07,287][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:14:07,946][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:14:08,603][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:14:09,262][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:14:09,920][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:14:10,578][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:14:11,236][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:14:11,894][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:14:12,552][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:14:13,211][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:14:13,872][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:14:14,530][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:14:15,190][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:14:15,849][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:14:16,507][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:14:17,166][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:14:17,824][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:14:18,483][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:14:19,143][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:14:19,801][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:14:20,458][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:14:21,119][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:14:21,778][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:14:22,437][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:14:23,095][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:14:24,094][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:14:24,751][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:14:25,410][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:14:26,068][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:14:26,728][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:14:27,386][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:14:28,045][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:14:28,704][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:14:29,363][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:14:30,021][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:14:30,679][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:14:31,339][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:14:31,998][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:14:32,656][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:14:33,315][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:14:33,973][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:14:34,633][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:14:35,428][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:44 [2026-03-25 20:14:36,734][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:14:36,736][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:14:36,737][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:14:38,184][__main__][INFO] - Iteration 398 took 52s (8.83% Gen, 88.43% Train). Generation: 4s, Training: 46s. Estimated remaining time: 8h 32m 32s. Estimated total time: 14h 41m 5s. Time estimates for 10 more iterations: 8m 48s, 100 more iterations: 1h 28m 6s, 500 more iterations: 7h 20m 32s. [2026-03-25 20:14:38,187][__main__][INFO] - Starting iteration 398. [2026-03-25 20:14:38,191][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 20:14:38,192][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:14:43,254][__main__][INFO] - Number of regex retries in iteration 398: 0 [2026-03-25 20:14:43,255][__main__][INFO] - agents played in iteration 398 are Bob, Alice [2026-03-25 20:14:43,846][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:14:43,907][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:14:43,908][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:14:43,908][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:14:44,773][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:14:45,381][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:14:46,041][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:14:46,700][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:14:47,358][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:14:48,017][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:14:48,676][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:14:49,336][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:14:49,996][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:14:50,655][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:14:51,313][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:14:51,974][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:14:52,632][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:14:53,291][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:14:53,951][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:14:54,611][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:14:55,270][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:14:55,929][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:14:56,588][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:14:57,247][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:14:57,906][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:14:58,566][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:14:59,226][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:14:59,885][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:15:00,542][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:15:01,202][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:15:01,860][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:15:02,521][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:15:03,180][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:15:03,839][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:15:04,498][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:15:05,156][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:15:05,815][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:15:06,474][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:15:07,132][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:15:07,926][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:15:08,585][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:15:09,244][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:15:09,904][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:15:10,564][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:15:11,223][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:15:11,883][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:15:12,542][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:15:13,201][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:15:13,860][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:15:14,520][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:15:15,178][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:15:15,838][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:15:16,822][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:15:17,482][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:15:18,140][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:15:18,799][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:15:19,458][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:15:20,116][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:15:20,774][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:15:21,432][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:15:22,091][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:15:22,751][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:15:23,410][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:15:24,069][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:15:24,730][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:15:25,389][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:15:26,047][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:15:26,707][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:15:27,366][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:15:28,134][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 20:15:29,569][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:15:29,571][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:15:29,573][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:15:31,017][__main__][INFO] - Iteration 399 took 52s (9.58% Gen, 87.68% Train). Generation: 5s, Training: 46s. Estimated remaining time: 8h 31m 2s. Estimated total time: 14h 40m 28s. Time estimates for 10 more iterations: 8m 48s, 100 more iterations: 1h 28m 2s, 500 more iterations: 7h 20m 14s. [2026-03-25 20:15:31,019][__main__][INFO] - Starting iteration 399. [2026-03-25 20:15:31,023][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 20:15:31,023][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:15:37,046][__main__][INFO] - Number of regex retries in iteration 399: 0 [2026-03-25 20:15:37,047][__main__][INFO] - agents played in iteration 399 are Bob, Alice [2026-03-25 20:15:37,939][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:15:38,001][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:15:38,002][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:15:38,002][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:15:38,732][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:15:39,350][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:15:40,006][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:15:40,665][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:15:41,323][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:15:41,983][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:15:42,642][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:15:43,300][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:15:43,957][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:15:44,615][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:15:45,273][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:15:45,932][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:15:46,591][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:15:47,250][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:15:47,909][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:15:48,568][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:15:49,226][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:15:49,884][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:15:50,543][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:15:51,201][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:15:51,858][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:15:52,515][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:15:53,173][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:15:53,831][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:15:54,505][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:15:58,479][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:15:59,138][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:15:59,798][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:16:00,456][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:16:01,113][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:16:01,773][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:16:02,432][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:16:03,091][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:16:03,750][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:16:04,408][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:16:05,066][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:16:05,725][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:16:06,385][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:16:07,043][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:16:07,701][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:16:08,360][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:16:09,019][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:16:09,679][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:16:10,339][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:16:10,998][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:16:11,656][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:16:12,314][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:16:12,973][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:16:13,962][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:16:14,622][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:16:15,282][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:16:15,941][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:16:16,600][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:16:17,258][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:16:17,916][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:16:18,575][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:16:19,239][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:16:19,901][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:16:20,558][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:16:21,218][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:16:21,877][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:16:22,536][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:16:23,196][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:16:23,855][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:16:24,516][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:16:25,307][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:46 [2026-03-25 20:16:26,299][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:16:26,301][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:16:26,302][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:16:27,646][__main__][INFO] - Iteration 400 took 56s (10.64% Gen, 86.99% Train). Generation: 6s, Training: 49s. Estimated remaining time: 9h 33m 22s. Estimated total time: 15h 43m 44s. Time estimates for 10 more iterations: 9m 26s, 100 more iterations: 1h 34m 22s, 500 more iterations: 7h 51m 52s. [2026-03-25 20:16:27,648][__main__][INFO] - Starting iteration 400. [2026-03-25 20:16:27,652][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 20:16:27,652][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:16:32,678][__main__][INFO] - Number of regex retries in iteration 400: 0 [2026-03-25 20:16:32,680][__main__][INFO] - agents played in iteration 400 are Bob, Alice [2026-03-25 20:16:33,177][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:16:33,244][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:16:33,244][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:16:33,245][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:16:33,917][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:16:34,532][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:16:35,192][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:16:35,853][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:16:36,513][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:16:37,172][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:16:37,831][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:16:38,491][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:16:39,150][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:16:39,810][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:16:40,469][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:16:41,127][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:16:41,788][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:16:42,447][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:16:43,107][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:16:43,767][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:16:44,428][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:16:45,089][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:16:45,748][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:16:46,408][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:16:47,067][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:16:47,728][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:16:48,388][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:16:49,048][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:16:49,710][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:16:50,370][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:16:51,030][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:16:51,689][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:16:52,355][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:16:53,017][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:16:53,677][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:16:54,337][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:16:54,997][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:16:55,657][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:16:56,319][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:16:56,979][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:16:57,640][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:16:58,302][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:16:58,963][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:16:59,624][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:17:00,285][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:17:00,945][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:17:01,606][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:17:02,267][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:17:02,928][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:17:03,588][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:17:04,247][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:17:04,906][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:17:05,893][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:17:06,553][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:17:07,213][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:17:07,873][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:17:08,532][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:17:09,190][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:17:09,849][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:17:10,508][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:17:11,166][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:17:11,826][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:17:12,484][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:17:13,142][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:17:13,800][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:17:14,459][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:17:15,117][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:17:15,777][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:17:16,436][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:17:17,263][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 20:17:18,850][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:17:18,853][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:17:18,854][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:17:22,450][__main__][INFO] - Iteration 401 took 54s (9.17% Gen, 84.26% Train). Generation: 5s, Training: 46s. Estimated remaining time: 9h 2m 2s. Estimated total time: 15h 13m 19s. Time estimates for 10 more iterations: 9m 7s, 100 more iterations: 1h 31m 19s, 500 more iterations: 7h 36m 39s. [2026-03-25 20:17:22,452][__main__][INFO] - Starting iteration 401. [2026-03-25 20:17:22,455][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 20:17:22,456][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:17:36,580][__main__][INFO] - Number of regex retries in iteration 401: 0 [2026-03-25 20:17:36,581][__main__][INFO] - agents played in iteration 401 are Bob, Alice [2026-03-25 20:17:37,179][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:17:37,242][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:17:37,242][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:17:37,243][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:17:38,127][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:17:38,801][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:17:39,455][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:17:40,112][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:17:40,772][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:17:41,431][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:17:42,088][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:17:42,747][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:17:43,405][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:17:44,063][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:17:44,720][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:17:45,380][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:17:46,038][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:17:46,698][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:17:47,356][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:17:48,016][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:17:48,675][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:17:49,333][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:17:49,992][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:17:50,652][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:17:51,311][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:17:51,970][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:17:52,627][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:17:53,286][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:17:53,945][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:17:54,603][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:17:55,261][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:17:55,919][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:17:56,578][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:17:57,236][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:17:57,894][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:17:58,552][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:17:59,211][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:17:59,869][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:18:00,527][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:18:01,185][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:18:01,844][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:18:02,502][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:18:03,160][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:18:03,817][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:18:04,475][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:18:05,134][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:18:05,792][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:18:06,449][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:18:07,108][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:18:07,765][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:18:08,425][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:18:09,084][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:18:10,075][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:18:10,733][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:18:11,392][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:18:12,051][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:18:12,709][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:18:13,376][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:18:14,035][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:18:14,695][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:18:15,353][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:18:16,011][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:18:16,669][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:18:17,328][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:18:17,986][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:18:18,645][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:18:19,303][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:18:19,961][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:18:20,619][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:18:21,407][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 20:18:23,118][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:18:23,121][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:18:23,122][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:18:24,648][__main__][INFO] - Iteration 402 took 1m 2s (22.71% Gen, 74.83% Train). Generation: 14s, Training: 46s. Estimated remaining time: 11h 4m 15s. Estimated total time: 17h 16m 35s. Time estimates for 10 more iterations: 10m 21s, 100 more iterations: 1h 43m 39s, 500 more iterations: 8h 38m 17s. [2026-03-25 20:18:24,652][__main__][INFO] - Starting iteration 402. [2026-03-25 20:18:24,657][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 20:18:24,657][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:18:30,858][__main__][INFO] - Number of regex retries in iteration 402: 0 [2026-03-25 20:18:30,861][__main__][INFO] - agents played in iteration 402 are Bob, Alice [2026-03-25 20:18:31,755][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:18:31,816][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:18:31,817][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:18:31,817][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:18:32,480][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:18:33,104][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:18:33,763][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:18:34,424][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:18:35,083][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:18:35,743][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:18:36,401][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:18:37,061][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:18:37,720][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:18:38,381][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:18:39,039][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:18:39,698][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:18:40,357][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:18:41,018][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:18:41,677][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:18:42,335][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:18:42,995][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:18:43,654][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:18:44,312][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:18:44,970][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:18:45,630][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:18:46,288][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:18:46,947][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:18:47,606][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:18:48,265][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:18:48,924][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:18:49,584][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:18:50,244][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:18:50,903][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:18:51,561][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:18:52,219][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:18:52,878][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:18:53,538][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:18:54,198][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:18:54,856][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:18:55,515][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:18:56,175][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:18:56,834][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:18:57,492][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:18:58,152][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:18:58,813][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:18:59,471][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:19:00,130][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:19:00,790][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:19:01,449][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:19:02,107][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:19:02,766][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:19:03,425][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:19:04,408][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:19:05,067][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:19:05,726][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:19:06,384][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:19:07,042][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:19:07,701][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:19:08,359][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:19:09,019][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:19:09,679][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:19:10,337][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:19:10,997][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:19:11,656][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:19:12,314][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:19:12,973][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:19:13,632][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:19:14,291][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:19:14,949][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:19:15,746][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 20:19:17,122][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:19:17,125][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:19:17,126][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:19:18,549][__main__][INFO] - Iteration 403 took 53s (11.51% Gen, 85.84% Train). Generation: 6s, Training: 46s. Estimated remaining time: 8h 45m 1s. Estimated total time: 14h 58m 14s. Time estimates for 10 more iterations: 8m 58s, 100 more iterations: 1h 29m 49s, 500 more iterations: 7h 29m 7s. [2026-03-25 20:19:18,552][__main__][INFO] - Starting iteration 403. [2026-03-25 20:19:18,559][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 20:19:18,560][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:19:23,562][__main__][INFO] - Number of regex retries in iteration 403: 0 [2026-03-25 20:19:23,563][__main__][INFO] - agents played in iteration 403 are Bob, Alice [2026-03-25 20:19:24,049][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:19:24,110][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:19:24,111][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:19:24,111][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:19:24,972][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:19:25,592][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:19:26,251][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:19:26,910][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:19:27,567][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:19:28,226][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:19:28,884][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:19:29,543][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:19:30,203][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:19:30,860][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:19:31,517][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:19:32,175][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:19:32,832][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:19:33,491][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:19:34,148][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:19:34,807][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:19:35,465][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:19:36,122][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:19:36,780][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:19:37,437][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:19:38,095][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:19:38,753][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:19:39,412][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:19:40,071][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:19:40,728][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:19:41,386][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:19:42,044][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:19:42,702][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:19:43,361][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:19:44,020][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:19:44,677][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:19:45,335][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:19:45,993][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:19:46,650][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:19:47,309][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:19:47,968][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:19:48,626][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:19:49,284][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:19:49,943][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:19:50,602][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:19:51,261][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:19:51,919][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:19:52,577][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:19:53,235][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:19:53,893][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:19:54,552][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:19:55,210][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:19:55,867][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:19:56,855][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:19:57,515][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:19:58,175][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:19:58,834][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:19:59,493][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:20:00,152][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:20:00,810][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:20:01,467][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:20:02,125][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:20:02,784][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:20:03,442][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:20:04,100][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:20:04,760][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:20:05,418][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:20:06,076][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:20:06,734][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:20:07,393][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:20:08,132][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 20:20:09,625][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:20:09,628][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:20:09,629][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:20:11,170][__main__][INFO] - Iteration 404 took 52s (9.51% Gen, 87.56% Train). Generation: 5s, Training: 46s. Estimated remaining time: 8h 22m 47s. Estimated total time: 14h 36m 53s. Time estimates for 10 more iterations: 8m 46s, 100 more iterations: 1h 27m 41s, 500 more iterations: 7h 18m 26s. [2026-03-25 20:20:11,172][__main__][INFO] - Starting iteration 404. [2026-03-25 20:20:11,176][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 20:20:11,177][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:20:16,389][__main__][INFO] - Number of regex retries in iteration 404: 0 [2026-03-25 20:20:16,390][__main__][INFO] - agents played in iteration 404 are Bob, Alice [2026-03-25 20:20:16,983][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:20:17,045][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:20:17,045][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:20:17,046][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:20:17,747][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:20:18,356][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:20:19,017][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:20:19,676][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:20:20,336][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:20:20,995][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:20:21,655][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:20:22,314][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:20:22,973][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:20:23,632][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:20:24,292][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:20:24,951][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:20:25,610][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:20:26,269][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:20:26,928][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:20:27,589][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:20:28,248][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:20:28,907][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:20:29,567][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:20:30,226][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:20:30,885][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:20:31,544][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:20:32,203][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:20:32,862][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:20:33,522][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:20:34,181][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:20:34,840][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:20:35,500][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:20:36,158][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:20:36,817][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:20:37,476][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:20:38,135][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:20:38,795][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:20:39,453][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:20:40,112][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:20:40,771][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:20:41,431][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:20:42,090][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:20:42,749][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:20:43,408][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:20:44,067][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:20:44,726][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:20:45,385][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:20:46,044][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:20:46,703][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:20:47,362][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:20:48,021][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:20:48,680][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:20:49,671][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:20:50,331][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:20:50,989][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:20:51,647][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:20:52,309][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:20:52,967][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:20:53,625][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:20:54,283][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:20:54,940][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:20:55,598][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:20:56,256][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:20:56,915][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:20:57,574][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:20:58,233][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:20:58,892][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:20:59,553][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:21:00,211][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:21:00,990][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 20:21:02,664][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:21:02,666][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:21:02,668][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:21:04,120][__main__][INFO] - Iteration 405 took 52s (9.85% Gen, 87.41% Train). Generation: 5s, Training: 46s. Estimated remaining time: 8h 27m 27s. Estimated total time: 14h 42m 25s. Time estimates for 10 more iterations: 8m 49s, 100 more iterations: 1h 28m 14s, 500 more iterations: 7h 21m 12s. [2026-03-25 20:21:04,123][__main__][INFO] - Starting iteration 405. [2026-03-25 20:21:04,127][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 20:21:04,127][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:21:19,519][__main__][INFO] - Number of regex retries in iteration 405: 0 [2026-03-25 20:21:19,520][__main__][INFO] - agents played in iteration 405 are Bob, Alice [2026-03-25 20:21:20,054][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:21:20,117][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:21:20,118][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:21:20,118][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:21:20,803][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:21:21,470][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:21:22,129][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:21:22,786][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:21:23,445][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:21:24,103][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:21:24,761][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:21:25,419][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:21:26,079][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:21:26,736][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:21:27,394][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:21:28,052][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:21:28,710][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:21:29,368][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:21:30,026][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:21:30,685][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:21:31,343][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:21:32,002][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:21:32,660][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:21:33,318][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:21:33,975][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:21:34,634][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:21:35,292][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:21:35,950][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:21:36,609][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:21:37,266][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:21:37,923][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:21:38,580][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:21:39,239][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:21:39,897][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:21:40,555][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:21:41,216][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:21:41,876][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:21:42,535][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:21:44,660][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:21:45,319][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:21:45,977][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:21:46,635][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:21:47,292][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:21:47,950][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:21:48,608][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:21:49,265][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:21:49,923][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:21:50,581][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:21:51,239][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:21:51,896][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:21:52,555][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:21:53,214][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:21:54,205][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:21:54,863][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:21:55,522][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:21:56,180][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:21:56,840][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:21:57,504][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:21:58,161][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:21:58,820][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:21:59,479][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:22:00,138][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:22:00,798][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:22:01,457][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:22:02,114][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:22:02,774][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:22:03,433][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:22:04,091][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:22:04,748][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:22:05,660][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:44 [2026-03-25 20:22:07,051][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:22:07,053][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:22:07,054][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:22:08,583][__main__][INFO] - Iteration 406 took 1m 4s (23.88% Gen, 73.74% Train). Generation: 15s, Training: 47s. Estimated remaining time: 11h 38m 14s. Estimated total time: 17h 54m 17s. Time estimates for 10 more iterations: 10m 44s, 100 more iterations: 1h 47m 25s, 500 more iterations: 8h 57m 8s. [2026-03-25 20:22:08,585][__main__][INFO] - Starting iteration 406. [2026-03-25 20:22:08,589][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 20:22:08,589][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:22:13,449][__main__][INFO] - Number of regex retries in iteration 406: 0 [2026-03-25 20:22:13,450][__main__][INFO] - agents played in iteration 406 are Bob, Alice [2026-03-25 20:22:13,934][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:22:13,996][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:22:13,997][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:22:13,998][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:22:14,761][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:22:15,395][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:22:16,054][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:22:16,717][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:22:17,376][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:22:18,036][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:22:18,695][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:22:19,355][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:22:20,014][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:22:20,673][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:22:21,333][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:22:21,992][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:22:22,652][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:22:23,311][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:22:23,971][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:22:24,629][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:22:25,288][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:22:25,948][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:22:26,608][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:22:27,266][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:22:27,926][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:22:28,586][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:22:29,245][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:22:29,904][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:22:30,567][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:22:31,227][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:22:31,886][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:22:32,544][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:22:33,204][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:22:33,863][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:22:34,522][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:22:35,180][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:22:35,839][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:22:36,505][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:22:37,165][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:22:37,825][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:22:38,485][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:22:39,145][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:22:39,804][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:22:40,465][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:22:41,124][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:22:41,782][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:22:42,441][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:22:43,101][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:22:43,761][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:22:44,421][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:22:45,081][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:22:45,740][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:22:46,731][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:22:47,391][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:22:48,050][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:22:48,710][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:22:49,369][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:22:50,028][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:22:50,687][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:22:51,345][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:22:52,006][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:22:52,665][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:22:53,324][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:22:53,985][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:22:54,644][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:22:55,303][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:22:55,961][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:22:56,619][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:22:57,278][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:22:58,054][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 20:22:59,489][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:22:59,492][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:22:59,493][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:23:00,996][__main__][INFO] - Iteration 407 took 52s (9.27% Gen, 87.85% Train). Generation: 4s, Training: 46s. Estimated remaining time: 8h 16m 33s. Estimated total time: 14h 33m 29s. Time estimates for 10 more iterations: 8m 44s, 100 more iterations: 1h 27m 20s, 500 more iterations: 7h 16m 44s. [2026-03-25 20:23:00,998][__main__][INFO] - Starting iteration 407. [2026-03-25 20:23:01,002][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 20:23:01,003][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:23:10,435][__main__][INFO] - Number of regex retries in iteration 407: 0 [2026-03-25 20:23:10,436][__main__][INFO] - agents played in iteration 407 are Bob, Alice [2026-03-25 20:23:11,036][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:23:11,096][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:23:11,097][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:23:11,097][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:23:11,887][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:23:12,509][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:23:13,167][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:23:13,826][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:23:14,487][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:23:16,202][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:23:16,859][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:23:17,518][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:23:18,177][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:23:18,836][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:23:19,495][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:23:20,155][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:23:20,812][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:23:21,472][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:23:22,131][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:23:22,790][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:23:23,447][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:23:24,106][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:23:24,765][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:23:25,424][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:23:26,083][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:23:26,741][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:23:27,400][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:23:28,058][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:23:28,728][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:23:29,387][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:23:30,046][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:23:30,704][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:23:31,363][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:23:32,022][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:23:32,685][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:23:33,344][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:23:34,003][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:23:34,662][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:23:35,320][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:23:35,978][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:23:36,638][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:23:37,298][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:23:37,957][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:23:38,616][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:23:39,276][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:23:39,937][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:23:40,596][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:23:41,254][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:23:41,913][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:23:42,572][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:23:43,229][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:23:43,886][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:23:44,885][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:23:45,543][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:23:46,201][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:23:46,860][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:23:47,518][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:23:48,176][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:23:48,835][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:23:49,492][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:23:50,149][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:23:50,807][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:23:51,465][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:23:52,122][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:23:52,781][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:23:53,438][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:23:54,096][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:23:54,754][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:23:55,414][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:23:56,527][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:44 [2026-03-25 20:23:58,384][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:23:58,387][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:23:58,388][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:23:59,813][__main__][INFO] - Iteration 408 took 58s (16.04% Gen, 81.53% Train). Generation: 9s, Training: 47s. Estimated remaining time: 10h 2m 18s. Estimated total time: 16h 20m 12s. Time estimates for 10 more iterations: 9m 48s, 100 more iterations: 1h 38m 1s, 500 more iterations: 8h 10m 6s. [2026-03-25 20:23:59,817][__main__][INFO] - Starting iteration 408. [2026-03-25 20:23:59,821][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 20:23:59,821][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:24:05,964][__main__][INFO] - Number of regex retries in iteration 408: 0 [2026-03-25 20:24:05,965][__main__][INFO] - agents played in iteration 408 are Bob, Alice [2026-03-25 20:24:06,805][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:24:06,866][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:24:06,866][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:24:06,867][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:24:07,543][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:24:08,152][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:24:08,812][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:24:09,471][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:24:10,129][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:24:10,787][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:24:11,446][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:24:12,107][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:24:12,764][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:24:13,422][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:24:14,080][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:24:14,737][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:24:15,395][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:24:17,671][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:24:18,333][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:24:18,990][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:24:19,648][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:24:20,306][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:24:20,964][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:24:21,623][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:24:22,280][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:24:22,937][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:24:23,596][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:24:24,253][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:24:24,911][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:24:25,568][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:24:26,227][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:24:26,884][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:24:27,542][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:24:28,200][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:24:28,858][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:24:29,517][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:24:30,175][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:24:30,832][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:24:31,491][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:24:32,149][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:24:32,807][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:24:33,465][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:24:34,123][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:24:34,781][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:24:35,441][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:24:36,098][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:24:36,756][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:24:37,415][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:24:38,072][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:24:38,730][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:24:39,389][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:24:40,047][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:24:41,040][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:24:41,698][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:24:42,356][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:24:43,014][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:24:43,672][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:24:44,331][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:24:44,989][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:24:45,649][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:24:46,307][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:24:46,965][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:24:47,623][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:24:48,281][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:24:48,939][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:24:49,597][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:24:50,256][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:24:50,913][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:24:51,571][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:24:52,409][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:44 [2026-03-25 20:24:54,294][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:24:54,302][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:24:54,304][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:24:55,775][__main__][INFO] - Iteration 409 took 55s (10.98% Gen, 86.39% Train). Generation: 6s, Training: 48s. Estimated remaining time: 9h 13m 45s. Estimated total time: 15h 32m 36s. Time estimates for 10 more iterations: 9m 19s, 100 more iterations: 1h 33m 15s, 500 more iterations: 7h 46m 18s. [2026-03-25 20:24:55,778][__main__][INFO] - Starting iteration 409. [2026-03-25 20:24:55,782][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 20:24:55,782][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:25:01,829][__main__][INFO] - Number of regex retries in iteration 409: 0 [2026-03-25 20:25:01,830][__main__][INFO] - agents played in iteration 409 are Bob, Alice [2026-03-25 20:25:02,340][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:25:02,401][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:25:02,402][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:25:02,402][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:25:03,234][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:25:03,859][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:25:04,517][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:25:05,176][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:25:05,834][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:25:06,493][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:25:07,152][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:25:07,812][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:25:08,471][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:25:09,128][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:25:09,793][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:25:10,450][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:25:11,108][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:25:11,768][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:25:12,429][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:25:13,088][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:25:13,748][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:25:14,406][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:25:15,065][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:25:15,723][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:25:16,380][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:25:17,038][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:25:17,697][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:25:18,355][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:25:19,012][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:25:19,671][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:25:20,329][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:25:20,987][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:25:21,646][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:25:22,304][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:25:22,964][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:25:23,622][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:25:24,281][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:25:24,940][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:25:25,598][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:25:26,257][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:25:26,916][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:25:27,574][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:25:28,233][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:25:28,893][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:25:29,552][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:25:30,211][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:25:30,870][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:25:31,530][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:25:32,189][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:25:32,847][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:25:33,507][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:25:34,166][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:25:35,174][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:25:35,834][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:25:36,492][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:25:37,151][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:25:37,810][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:25:38,468][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:25:39,127][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:25:39,785][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:25:40,444][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:25:41,101][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:25:41,761][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:25:42,421][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:25:43,080][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:25:43,739][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:25:44,398][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:25:45,058][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:25:45,717][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:25:46,526][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 20:25:47,896][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:25:47,899][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:25:47,900][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:25:49,425][__main__][INFO] - Iteration 410 took 53s (11.27% Gen, 85.88% Train). Generation: 6s, Training: 46s. Estimated remaining time: 8h 34m 21s. Estimated total time: 14h 54m 5s. Time estimates for 10 more iterations: 8m 56s, 100 more iterations: 1h 29m 24s, 500 more iterations: 7h 27m 2s. [2026-03-25 20:25:49,427][__main__][INFO] - Starting iteration 410. [2026-03-25 20:25:49,431][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 20:25:49,432][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:25:54,436][__main__][INFO] - Number of regex retries in iteration 410: 0 [2026-03-25 20:25:54,437][__main__][INFO] - agents played in iteration 410 are Bob, Alice [2026-03-25 20:25:55,036][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:25:55,098][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:25:55,098][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:25:55,099][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:25:55,847][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:25:56,454][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:25:57,113][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:25:57,773][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:25:58,432][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:25:59,091][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:25:59,750][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:26:00,409][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:26:01,067][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:26:01,725][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:26:02,384][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:26:03,041][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:26:03,700][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:26:04,359][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:26:05,017][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:26:05,681][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:26:06,343][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:26:07,005][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:26:07,663][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:26:08,321][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:26:08,980][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:26:09,639][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:26:10,298][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:26:10,957][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:26:11,617][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:26:12,275][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:26:12,933][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:26:13,592][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:26:14,252][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:26:14,910][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:26:15,569][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:26:16,226][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:26:16,887][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:26:17,546][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:26:18,205][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:26:18,865][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:26:19,525][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:26:20,184][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:26:20,843][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:26:21,503][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:26:22,162][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:26:22,821][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:26:23,482][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:26:24,142][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:26:24,802][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:26:25,461][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:26:26,119][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:26:26,777][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:26:27,765][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:26:28,424][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:26:29,084][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:26:29,741][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:26:30,399][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:26:31,059][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:26:31,719][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:26:32,379][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:26:33,041][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:26:33,702][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:26:34,362][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:26:35,021][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:26:35,680][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:26:36,339][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:26:36,999][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:26:37,660][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:26:38,319][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:26:39,093][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 20:26:40,357][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:26:40,360][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:26:40,361][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:26:41,853][__main__][INFO] - Iteration 411 took 52s (9.55% Gen, 87.60% Train). Generation: 5s, Training: 45s. Estimated remaining time: 8h 13m 7s. Estimated total time: 14h 33m 43s. Time estimates for 10 more iterations: 8m 44s, 100 more iterations: 1h 27m 22s, 500 more iterations: 7h 16m 51s. [2026-03-25 20:26:41,855][__main__][INFO] - Starting iteration 411. [2026-03-25 20:26:41,859][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 20:26:41,859][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:26:47,248][__main__][INFO] - Number of regex retries in iteration 411: 0 [2026-03-25 20:26:47,249][__main__][INFO] - agents played in iteration 411 are Bob, Alice [2026-03-25 20:26:47,778][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:26:47,839][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:26:47,840][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:26:47,840][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:26:48,503][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:26:49,114][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:26:49,775][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:26:50,436][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:26:51,096][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:26:51,756][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:26:52,417][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:26:53,076][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:26:53,738][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:26:54,398][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:26:55,059][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:26:55,719][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:26:56,378][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:26:57,038][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:26:57,697][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:26:58,358][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:26:59,018][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:26:59,678][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:27:00,338][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:27:00,998][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:27:01,656][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:27:02,314][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:27:02,974][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:27:03,634][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:27:04,295][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:27:04,954][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:27:05,613][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:27:06,273][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:27:06,933][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:27:07,594][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:27:08,253][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:27:08,912][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:27:09,571][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:27:10,230][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:27:10,889][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:27:11,548][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:27:12,207][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:27:12,869][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:27:13,529][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:27:14,189][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:27:14,848][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:27:15,506][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:27:16,166][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:27:16,825][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:27:17,483][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:27:18,142][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:27:18,801][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:27:19,462][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:27:20,459][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:27:21,119][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:27:21,778][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:27:22,436][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:27:23,094][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:27:23,753][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:27:24,413][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:27:25,071][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:27:25,731][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:27:26,389][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:27:27,049][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:27:27,707][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:27:28,367][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:27:29,028][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:27:29,688][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:27:30,347][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:27:31,005][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:27:31,851][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 20:27:33,743][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:27:33,746][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:27:43,502][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:27:46,750][__main__][INFO] - Iteration 412 took 1m 4s (8.31% Gen, 86.69% Train). Generation: 5s, Training: 56s. Estimated remaining time: 11h 39m 51s. Estimated total time: 18h 1m 32s. Time estimates for 10 more iterations: 10m 48s, 100 more iterations: 1h 48m 9s, 500 more iterations: 9h 0m 46s. [2026-03-25 20:27:46,753][__main__][INFO] - Starting iteration 412. [2026-03-25 20:27:46,768][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 20:27:46,768][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:27:52,381][__main__][INFO] - Number of regex retries in iteration 412: 0 [2026-03-25 20:27:52,384][__main__][INFO] - agents played in iteration 412 are Bob, Alice [2026-03-25 20:27:53,214][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:27:53,277][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:27:53,277][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:27:53,278][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:27:54,105][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:27:54,727][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:27:55,376][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:27:56,035][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:27:56,695][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:27:57,354][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:27:58,014][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:27:58,674][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:27:59,334][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:27:59,994][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:28:00,652][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:28:01,311][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:28:01,970][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:28:02,631][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:28:03,290][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:28:03,950][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:28:04,608][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:28:05,267][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:28:05,926][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:28:06,586][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:28:07,245][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:28:07,903][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:28:08,562][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:28:09,222][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:28:09,880][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:28:10,539][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:28:11,198][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:28:11,857][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:28:12,516][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:28:13,176][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:28:13,835][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:28:14,494][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:28:15,152][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:28:15,811][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:28:16,471][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:28:17,130][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:28:17,789][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:28:18,448][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:28:19,108][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:28:19,767][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:28:20,426][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:28:21,087][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:28:21,746][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:28:22,405][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:28:23,064][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:28:24,904][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:28:25,561][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:28:26,218][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:28:27,204][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:28:27,864][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:28:28,527][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:28:29,186][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:28:29,843][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:28:30,502][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:28:31,160][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:28:31,820][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:28:32,478][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:28:33,136][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:28:33,795][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:28:34,452][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:28:35,112][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:28:35,771][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:28:36,430][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:28:37,088][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:28:37,748][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:28:38,556][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:44 [2026-03-25 20:28:39,776][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:28:39,778][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:28:39,779][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:28:41,281][__main__][INFO] - Iteration 413 took 54s (10.30% Gen, 86.94% Train). Generation: 5s, Training: 47s. Estimated remaining time: 8h 45m 58s. Estimated total time: 15h 8m 34s. Time estimates for 10 more iterations: 9m 5s, 100 more iterations: 1h 30m 51s, 500 more iterations: 7h 34m 17s. [2026-03-25 20:28:41,283][__main__][INFO] - Starting iteration 413. [2026-03-25 20:28:41,286][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 20:28:41,287][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:28:46,241][__main__][INFO] - Number of regex retries in iteration 413: 0 [2026-03-25 20:28:46,242][__main__][INFO] - agents played in iteration 413 are Bob, Alice [2026-03-25 20:28:46,755][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:28:46,815][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:28:46,816][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:28:46,816][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:28:47,497][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:28:48,104][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:28:48,766][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:28:49,425][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:28:50,085][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:28:50,744][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:28:51,403][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:28:52,063][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:28:52,721][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:28:53,381][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:28:54,042][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:28:54,702][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:28:55,361][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:28:56,020][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:28:56,681][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:28:57,340][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:28:57,999][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:28:58,659][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:28:59,319][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:28:59,979][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:29:00,638][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:29:01,296][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:29:01,955][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:29:02,616][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:29:03,276][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:29:03,935][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:29:04,595][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:29:05,256][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:29:05,916][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:29:06,576][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:29:07,235][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:29:07,895][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:29:08,555][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:29:09,216][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:29:09,875][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:29:10,534][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:29:11,195][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:29:11,854][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:29:12,513][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:29:13,173][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:29:13,833][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:29:14,492][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:29:15,151][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:29:15,812][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:29:16,472][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:29:17,131][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:29:17,791][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:29:18,450][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:29:19,871][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:29:20,527][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:29:21,191][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:29:21,851][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:29:22,511][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:29:23,169][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:29:23,828][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:29:24,488][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:29:25,146][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:29:25,805][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:29:26,463][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:29:27,122][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:29:27,782][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:29:28,441][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:29:29,100][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:29:29,760][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:29:30,419][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:29:31,195][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 20:29:32,473][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:29:32,475][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:29:32,477][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:29:34,058][__main__][INFO] - Iteration 414 took 52s (9.39% Gen, 87.61% Train). Generation: 4s, Training: 46s. Estimated remaining time: 8h 16m 4s. Estimated total time: 14h 39m 33s. Time estimates for 10 more iterations: 8m 47s, 100 more iterations: 1h 27m 57s, 500 more iterations: 7h 19m 46s. [2026-03-25 20:29:34,060][__main__][INFO] - Starting iteration 414. [2026-03-25 20:29:34,064][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 20:29:34,064][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:29:39,646][__main__][INFO] - Number of regex retries in iteration 414: 0 [2026-03-25 20:29:39,647][__main__][INFO] - agents played in iteration 414 are Bob, Alice [2026-03-25 20:29:40,218][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:29:40,280][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:29:40,280][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:29:40,281][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:29:41,089][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:29:41,709][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:29:42,371][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:29:43,029][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:29:43,689][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:29:44,348][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:29:45,009][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:29:45,669][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:29:46,329][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:29:46,987][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:29:47,649][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:29:48,307][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:29:48,966][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:29:49,626][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:29:50,286][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:29:50,946][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:29:51,606][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:29:52,266][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:29:52,925][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:29:53,586][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:29:54,247][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:29:54,905][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:29:55,565][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:29:56,226][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:29:56,888][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:29:57,547][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:29:58,209][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:29:58,871][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:29:59,532][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:30:00,191][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:30:00,851][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:30:01,511][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:30:02,171][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:30:02,834][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:30:03,494][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:30:04,153][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:30:04,814][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:30:05,473][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:30:06,133][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:30:06,793][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:30:07,452][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:30:08,113][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:30:08,772][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:30:09,432][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:30:10,094][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:30:10,753][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:30:11,413][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:30:12,072][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:30:13,063][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:30:13,722][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:30:14,381][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:30:15,041][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:30:15,699][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:30:16,358][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:30:17,016][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:30:17,677][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:30:18,335][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:30:18,994][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:30:19,652][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:30:20,311][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:30:20,971][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:30:21,629][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:30:22,288][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:30:22,946][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:30:23,606][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:30:24,393][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 20:30:25,728][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:30:25,731][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:30:25,732][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:30:27,260][__main__][INFO] - Iteration 415 took 53s (10.50% Gen, 86.63% Train). Generation: 5s, Training: 46s. Estimated remaining time: 8h 22m 16s. Estimated total time: 14h 46m 38s. Time estimates for 10 more iterations: 8m 51s, 100 more iterations: 1h 28m 39s, 500 more iterations: 7h 23m 19s. [2026-03-25 20:30:27,263][__main__][INFO] - Starting iteration 415. [2026-03-25 20:30:27,266][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 20:30:27,267][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:30:32,555][__main__][INFO] - Number of regex retries in iteration 415: 0 [2026-03-25 20:30:32,556][__main__][INFO] - agents played in iteration 415 are Bob, Alice [2026-03-25 20:30:33,486][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:30:33,547][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:30:33,548][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:30:33,548][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:30:34,211][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:30:34,987][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:30:35,647][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:30:36,308][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:30:36,967][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:30:37,628][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:30:38,286][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:30:38,947][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:30:39,606][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:30:40,265][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:30:40,926][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:30:41,585][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:30:42,245][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:30:42,905][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:30:43,564][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:30:44,223][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:30:44,882][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:30:45,541][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:30:46,202][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:30:46,860][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:30:47,520][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:30:48,178][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:30:48,836][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:30:49,495][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:30:50,156][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:30:50,815][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:30:51,474][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:30:52,133][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:30:52,793][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:30:53,452][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:30:54,111][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:30:54,770][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:30:55,430][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:30:56,090][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:30:56,749][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:30:57,408][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:30:58,067][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:30:58,727][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:30:59,385][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:31:00,044][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:31:00,703][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:31:01,363][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:31:02,022][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:31:02,681][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:31:03,340][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:31:03,999][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:31:04,658][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:31:05,317][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:31:06,318][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:31:06,978][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:31:07,637][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:31:08,296][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:31:08,955][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:31:09,614][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:31:10,272][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:31:10,931][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:31:11,591][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:31:12,250][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:31:12,908][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:31:13,566][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:31:14,224][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:31:14,883][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:31:15,541][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:31:16,200][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:31:16,858][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:31:17,717][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 20:31:19,051][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:31:19,054][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:31:19,055][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:31:20,579][__main__][INFO] - Iteration 416 took 53s (9.92% Gen, 87.22% Train). Generation: 5s, Training: 46s. Estimated remaining time: 8h 23m 18s. Estimated total time: 14h 48m 33s. Time estimates for 10 more iterations: 8m 53s, 100 more iterations: 1h 28m 51s, 500 more iterations: 7h 24m 16s. [2026-03-25 20:31:20,581][__main__][INFO] - Starting iteration 416. [2026-03-25 20:31:20,586][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 20:31:20,587][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:31:26,248][__main__][INFO] - Number of regex retries in iteration 416: 0 [2026-03-25 20:31:26,250][__main__][INFO] - agents played in iteration 416 are Bob, Alice [2026-03-25 20:31:26,731][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:31:26,792][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:31:26,793][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:31:26,793][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:31:27,582][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:31:28,210][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:31:28,872][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:31:29,531][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:31:30,192][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:31:30,850][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:31:31,509][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:31:32,168][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:31:32,827][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:31:33,486][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:31:34,145][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:31:34,803][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:31:35,463][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:31:36,122][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:31:36,781][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:31:37,440][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:31:38,099][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:31:38,757][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:31:39,418][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:31:40,077][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:31:40,737][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:31:41,395][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:31:42,054][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:31:42,715][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:31:43,374][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:31:44,033][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:31:44,693][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:31:45,352][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:31:46,011][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:31:46,670][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:31:47,329][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:31:47,990][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:31:48,649][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:31:49,308][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:31:49,967][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:31:50,626][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:31:51,285][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:31:51,944][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:31:52,603][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:31:53,261][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:31:53,920][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:31:54,579][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:31:55,239][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:31:56,968][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:31:57,626][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:31:58,284][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:31:58,942][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:31:59,600][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:32:00,631][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:32:01,290][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:32:01,949][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:32:02,607][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:32:03,265][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:32:03,924][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:32:04,582][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:32:05,241][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:32:05,900][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:32:06,557][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:32:07,216][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:32:07,874][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:32:08,533][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:32:09,192][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:32:09,850][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:32:10,510][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:32:11,168][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:32:11,940][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:44 [2026-03-25 20:32:13,300][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:32:13,302][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:32:13,304][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:32:14,829][__main__][INFO] - Iteration 417 took 54s (10.44% Gen, 86.74% Train). Generation: 5s, Training: 47s. Estimated remaining time: 8h 37m 55s. Estimated total time: 15h 4m 5s. Time estimates for 10 more iterations: 9m 2s, 100 more iterations: 1h 30m 24s, 500 more iterations: 7h 32m 2s. [2026-03-25 20:32:14,831][__main__][INFO] - Starting iteration 417. [2026-03-25 20:32:14,835][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 20:32:14,836][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:32:20,663][__main__][INFO] - Number of regex retries in iteration 417: 0 [2026-03-25 20:32:20,664][__main__][INFO] - agents played in iteration 417 are Bob, Alice [2026-03-25 20:32:21,163][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:32:21,227][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:32:21,227][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:32:21,228][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:32:21,907][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:32:22,521][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:32:23,179][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:32:23,838][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:32:24,497][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:32:25,155][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:32:25,814][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:32:26,474][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:32:27,132][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:32:27,791][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:32:28,450][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:32:29,109][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:32:29,770][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:32:30,429][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:32:31,088][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:32:31,746][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:32:32,405][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:32:33,064][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:32:33,724][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:32:34,383][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:32:35,042][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:32:35,700][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:32:36,359][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:32:37,018][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:32:37,677][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:32:38,336][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:32:38,996][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:32:39,656][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:32:40,314][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:32:40,973][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:32:41,632][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:32:42,291][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:32:42,949][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:32:43,608][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:32:44,267][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:32:44,925][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:32:45,583][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:32:46,243][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:32:46,902][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:32:47,560][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:32:48,219][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:32:48,879][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:32:49,538][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:32:50,197][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:32:50,857][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:32:51,516][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:32:52,175][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:32:52,834][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:32:53,826][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:32:54,485][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:32:55,146][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:32:55,803][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:32:56,464][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:32:57,122][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:32:57,783][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:32:58,444][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:32:59,105][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:32:59,763][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:33:00,422][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:33:01,083][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:33:01,743][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:33:02,402][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:33:03,061][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:33:03,720][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:33:04,379][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:33:05,162][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 20:33:06,233][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:33:06,235][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:33:06,237][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:33:07,677][__main__][INFO] - Iteration 418 took 52s (11.03% Gen, 86.24% Train). Generation: 5s, Training: 45s. Estimated remaining time: 8h 13m 41s. Estimated total time: 14h 40m 43s. Time estimates for 10 more iterations: 8m 48s, 100 more iterations: 1h 28m 4s, 500 more iterations: 7h 20m 21s. [2026-03-25 20:33:07,679][__main__][INFO] - Starting iteration 418. [2026-03-25 20:33:07,682][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 20:33:07,683][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:33:12,735][__main__][INFO] - Number of regex retries in iteration 418: 0 [2026-03-25 20:33:12,736][__main__][INFO] - agents played in iteration 418 are Bob, Alice [2026-03-25 20:33:13,213][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:33:13,275][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:33:13,276][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:33:13,276][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:33:14,054][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:33:14,675][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:33:15,335][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:33:15,996][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:33:16,654][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:33:17,313][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:33:17,972][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:33:18,634][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:33:19,295][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:33:19,957][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:33:20,618][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:33:21,278][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:33:21,938][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:33:22,600][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:33:23,259][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:33:23,918][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:33:24,578][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:33:25,237][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:33:25,897][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:33:26,557][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:33:27,216][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:33:27,874][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:33:28,540][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:33:29,201][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:33:29,860][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:33:30,520][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:33:31,181][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:33:31,840][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:33:32,499][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:33:33,160][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:33:33,819][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:33:34,480][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:33:35,142][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:33:35,804][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:33:36,464][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:33:37,125][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:33:37,785][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:33:38,447][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:33:39,107][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:33:39,766][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:33:40,427][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:33:41,086][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:33:41,745][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:33:42,404][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:33:43,064][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:33:43,723][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:33:44,382][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:33:45,042][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:33:46,041][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:33:46,702][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:33:47,361][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:33:48,021][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:33:48,681][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:33:49,340][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:33:50,001][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:33:50,660][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:33:51,321][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:33:51,980][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:33:52,639][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:33:53,297][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:33:53,955][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:33:54,614][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:33:55,273][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:33:55,931][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:33:56,589][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:33:57,449][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 20:33:59,179][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:33:59,182][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:33:59,183][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:34:00,723][__main__][INFO] - Iteration 419 took 53s (9.53% Gen, 87.57% Train). Generation: 5s, Training: 46s. Estimated remaining time: 8h 16m 6s. Estimated total time: 14h 44m 1s. Time estimates for 10 more iterations: 8m 50s, 100 more iterations: 1h 28m 24s, 500 more iterations: 7h 22m 0s. [2026-03-25 20:34:00,725][__main__][INFO] - Starting iteration 419. [2026-03-25 20:34:00,729][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 20:34:00,730][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:34:07,198][__main__][INFO] - Number of regex retries in iteration 419: 0 [2026-03-25 20:34:07,199][__main__][INFO] - agents played in iteration 419 are Bob, Alice [2026-03-25 20:34:07,680][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:34:07,747][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:34:07,747][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:34:07,748][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:34:08,426][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:34:09,042][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:34:09,702][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:34:10,362][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:34:11,020][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:34:11,681][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:34:12,342][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:34:13,000][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:34:13,660][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:34:14,318][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:34:14,977][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:34:15,636][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:34:16,293][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:34:16,952][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:34:17,609][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:34:18,268][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:34:18,926][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:34:19,584][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:34:20,242][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:34:20,900][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:34:21,559][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:34:22,218][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:34:22,875][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:34:23,534][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:34:24,193][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:34:24,850][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:34:25,508][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:34:26,166][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:34:26,824][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:34:27,483][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:34:28,142][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:34:28,802][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:34:29,459][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:34:30,119][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:34:30,778][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:34:31,436][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:34:32,094][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:34:32,753][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:34:33,411][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:34:34,069][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:34:34,726][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:34:35,385][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:34:36,043][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:34:36,701][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:34:37,360][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:34:38,018][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:34:38,678][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:34:39,336][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:34:40,325][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:34:40,985][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:34:41,644][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:34:42,303][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:34:42,963][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:34:43,621][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:34:44,279][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:34:44,939][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:34:45,596][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:34:46,255][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:34:46,915][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:34:47,573][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:34:48,232][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:34:48,892][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:34:49,550][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:34:50,208][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:34:50,867][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:34:51,761][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 20:34:53,160][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:34:53,162][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:34:53,164][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:34:54,714][__main__][INFO] - Iteration 420 took 53s (11.98% Gen, 85.14% Train). Generation: 6s, Training: 45s. Estimated remaining time: 8h 30m 56s. Estimated total time: 14h 59m 46s. Time estimates for 10 more iterations: 8m 59s, 100 more iterations: 1h 29m 58s, 500 more iterations: 7h 29m 53s. [2026-03-25 20:34:54,716][__main__][INFO] - Starting iteration 420. [2026-03-25 20:34:54,720][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 20:34:54,721][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:35:00,316][__main__][INFO] - Number of regex retries in iteration 420: 0 [2026-03-25 20:35:00,317][__main__][INFO] - agents played in iteration 420 are Bob, Alice [2026-03-25 20:35:01,409][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:35:01,474][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:35:01,475][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:35:01,475][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:35:02,235][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:35:02,839][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:35:03,499][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:35:04,159][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:35:04,818][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:35:05,478][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:35:06,139][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:35:06,800][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:35:07,460][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:35:08,119][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:35:08,779][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:35:09,438][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:35:10,098][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:35:10,758][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:35:11,419][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:35:12,079][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:35:12,737][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:35:13,396][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:35:14,056][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:35:14,715][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:35:15,378][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:35:16,037][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:35:16,695][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:35:17,354][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:35:18,014][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:35:18,673][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:35:19,331][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:35:19,991][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:35:20,650][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:35:21,310][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:35:21,970][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:35:22,629][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:35:23,288][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:35:23,948][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:35:24,607][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:35:25,267][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:35:25,926][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:35:26,585][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:35:27,244][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:35:27,904][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:35:28,566][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:35:29,227][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:35:29,886][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:35:30,546][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:35:31,207][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:35:31,866][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:35:32,527][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:35:33,187][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:35:34,193][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:35:34,853][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:35:35,512][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:35:36,171][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:35:36,832][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:35:37,491][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:35:38,150][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:35:38,809][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:35:39,470][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:35:40,129][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:35:40,788][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:35:41,449][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:35:42,110][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:35:42,768][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:35:43,428][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:35:44,088][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:35:44,747][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:35:45,501][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 20:35:47,370][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:35:47,373][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:35:47,374][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:35:48,857][__main__][INFO] - Iteration 421 took 54s (10.34% Gen, 86.92% Train). Generation: 5s, Training: 47s. Estimated remaining time: 8h 32m 35s. Estimated total time: 15h 2m 19s. Time estimates for 10 more iterations: 9m 1s, 100 more iterations: 1h 30m 13s, 500 more iterations: 7h 31m 9s. [2026-03-25 20:35:48,861][__main__][INFO] - Starting iteration 421. [2026-03-25 20:35:48,865][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 20:35:48,865][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:35:53,934][__main__][INFO] - Number of regex retries in iteration 421: 0 [2026-03-25 20:35:53,936][__main__][INFO] - agents played in iteration 421 are Bob, Alice [2026-03-25 20:35:54,428][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:35:54,490][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:35:54,491][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:35:54,491][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:35:55,189][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:35:55,805][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:35:56,468][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:35:57,126][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:35:57,785][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:35:58,447][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:35:59,107][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:35:59,769][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:36:00,429][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:36:01,089][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:36:01,748][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:36:02,409][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:36:03,069][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:36:03,728][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:36:04,387][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:36:05,046][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:36:05,706][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:36:06,366][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:36:07,026][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:36:07,714][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:36:08,380][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:36:09,039][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:36:09,698][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:36:10,360][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:36:11,017][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:36:11,677][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:36:12,337][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:36:12,999][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:36:13,658][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:36:14,318][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:36:14,978][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:36:15,637][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:36:16,297][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:36:16,956][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:36:17,615][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:36:18,275][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:36:18,934][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:36:19,593][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:36:20,253][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:36:20,913][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:36:21,571][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:36:22,230][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:36:22,889][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:36:23,550][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:36:24,209][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:36:24,870][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:36:25,529][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:36:26,188][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:36:28,445][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:36:29,104][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:36:29,763][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:36:30,422][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:36:31,081][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:36:31,739][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:36:32,398][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:36:33,056][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:36:33,714][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:36:34,373][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:36:35,031][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:36:35,690][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:36:36,350][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:36:37,008][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:36:37,666][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:36:38,324][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:36:38,982][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:36:39,863][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:44 [2026-03-25 20:36:41,253][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:36:41,255][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:36:41,257][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:36:42,666][__main__][INFO] - Iteration 422 took 53s (9.42% Gen, 87.95% Train). Generation: 5s, Training: 47s. Estimated remaining time: 8h 26m 6s. Estimated total time: 14h 56m 43s. Time estimates for 10 more iterations: 8m 58s, 100 more iterations: 1h 29m 40s, 500 more iterations: 7h 28m 21s. [2026-03-25 20:36:42,669][__main__][INFO] - Starting iteration 422. [2026-03-25 20:36:42,674][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 20:36:42,674][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:36:47,499][__main__][INFO] - Number of regex retries in iteration 422: 0 [2026-03-25 20:36:47,500][__main__][INFO] - agents played in iteration 422 are Bob, Alice [2026-03-25 20:36:47,993][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:36:48,055][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:36:48,056][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:36:48,056][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:36:48,981][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:36:49,597][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:36:50,255][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:36:50,916][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:36:51,576][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:36:52,236][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:36:52,895][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:36:53,553][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:36:54,212][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:36:54,870][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:36:55,530][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:36:56,189][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:36:56,848][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:36:57,506][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:36:58,165][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:36:58,824][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:36:59,482][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:37:00,141][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:37:00,800][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:37:01,459][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:37:02,118][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:37:02,776][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:37:03,435][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:37:04,093][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:37:04,752][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:37:05,409][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:37:06,067][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:37:06,725][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:37:07,382][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:37:08,042][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:37:08,700][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:37:09,358][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:37:10,017][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:37:10,677][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:37:11,336][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:37:11,993][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:37:12,651][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:37:13,309][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:37:13,967][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:37:14,624][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:37:15,284][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:37:15,941][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:37:16,600][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:37:17,260][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:37:17,919][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:37:18,578][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:37:19,236][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:37:19,895][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:37:20,901][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:37:21,561][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:37:22,220][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:37:22,878][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:37:23,537][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:37:24,196][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:37:24,851][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:37:25,509][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:37:26,167][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:37:26,825][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:37:27,485][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:37:28,143][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:37:28,803][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:37:29,462][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:37:30,122][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:37:30,780][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:37:31,477][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:37:32,369][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 20:37:33,918][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:37:33,921][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:37:33,922][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:37:35,418][__main__][INFO] - Iteration 423 took 52s (9.15% Gen, 88.01% Train). Generation: 4s, Training: 46s. Estimated remaining time: 8h 7m 35s. Estimated total time: 14h 39m 5s. Time estimates for 10 more iterations: 8m 47s, 100 more iterations: 1h 27m 54s, 500 more iterations: 7h 19m 32s. [2026-03-25 20:37:35,428][__main__][INFO] - Starting iteration 423. [2026-03-25 20:37:35,459][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 20:37:35,459][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:37:40,732][__main__][INFO] - Number of regex retries in iteration 423: 0 [2026-03-25 20:37:40,734][__main__][INFO] - agents played in iteration 423 are Bob, Alice [2026-03-25 20:37:41,603][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:37:41,665][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:37:41,666][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:37:41,666][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:37:42,358][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:37:42,980][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:37:43,641][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:37:44,299][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:37:44,958][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:37:45,617][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:37:46,274][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:37:47,022][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:37:47,681][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:37:48,339][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:37:48,996][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:37:49,654][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:37:50,317][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:37:50,974][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:37:51,633][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:37:52,292][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:37:52,950][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:37:53,609][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:37:54,267][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:37:54,925][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:37:55,582][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:37:56,241][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:37:56,899][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:37:57,557][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:37:58,216][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:37:58,875][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:37:59,532][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:38:00,190][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:38:00,848][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:38:01,506][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:38:02,165][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:38:02,823][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:38:03,485][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:38:04,143][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:38:04,801][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:38:05,460][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:38:06,118][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:38:06,776][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:38:07,435][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:38:08,092][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:38:08,751][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:38:09,410][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:38:10,068][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:38:10,726][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:38:11,385][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:38:12,044][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:38:12,702][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:38:13,360][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:38:14,346][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:38:15,006][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:38:15,666][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:38:16,324][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:38:16,982][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:38:17,640][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:38:18,299][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:38:18,959][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:38:19,619][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:38:20,277][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:38:20,935][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:38:21,593][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:38:22,251][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:38:22,908][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:38:23,566][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:38:24,223][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:38:24,883][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:38:25,727][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 20:38:27,108][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:38:27,112][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:38:27,113][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:38:28,614][__main__][INFO] - Iteration 424 took 53s (9.92% Gen, 87.25% Train). Generation: 5s, Training: 46s. Estimated remaining time: 8h 13m 34s. Estimated total time: 14h 45m 57s. Time estimates for 10 more iterations: 8m 51s, 100 more iterations: 1h 28m 35s, 500 more iterations: 7h 22m 58s. [2026-03-25 20:38:28,619][__main__][INFO] - Starting iteration 424. [2026-03-25 20:38:28,623][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 20:38:28,623][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:38:34,048][__main__][INFO] - Number of regex retries in iteration 424: 0 [2026-03-25 20:38:34,050][__main__][INFO] - agents played in iteration 424 are Bob, Alice [2026-03-25 20:38:34,580][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:38:34,642][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:38:34,643][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:38:34,643][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:38:35,398][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:38:36,058][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:38:36,718][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:38:37,377][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:38:38,036][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:38:38,697][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:38:39,356][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:38:40,014][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:38:40,672][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:38:41,332][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:38:41,991][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:38:42,651][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:38:43,309][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:38:43,967][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:38:44,626][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:38:45,284][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:38:45,942][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:38:46,601][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:38:47,262][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:38:47,922][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:38:48,582][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:38:49,241][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:38:49,899][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:38:50,558][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:38:51,217][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:38:51,875][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:38:52,534][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:38:53,192][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:38:53,851][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:38:54,510][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:38:55,169][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:38:55,827][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:38:56,487][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:38:57,146][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:38:57,806][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:38:58,467][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:38:59,126][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:38:59,785][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:39:00,444][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:39:01,103][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:39:01,763][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:39:02,422][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:39:03,082][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:39:03,742][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:39:04,400][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:39:05,059][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:39:05,717][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:39:06,377][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:39:07,359][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:39:08,018][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:39:08,678][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:39:09,337][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:39:09,995][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:39:10,654][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:39:11,313][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:39:11,971][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:39:12,629][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:39:13,287][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:39:13,945][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:39:14,603][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:39:15,262][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:39:15,922][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:39:16,581][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:39:17,240][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:39:17,899][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:39:18,761][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 20:39:20,157][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:39:20,159][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:39:20,161][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:39:21,595][__main__][INFO] - Iteration 425 took 52s (10.24% Gen, 87.04% Train). Generation: 5s, Training: 46s. Estimated remaining time: 8h 9m 37s. Estimated total time: 14h 42m 53s. Time estimates for 10 more iterations: 8m 49s, 100 more iterations: 1h 28m 17s, 500 more iterations: 7h 21m 26s. [2026-03-25 20:39:21,597][__main__][INFO] - Starting iteration 425. [2026-03-25 20:39:21,601][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 20:39:21,601][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:39:30,244][__main__][INFO] - Number of regex retries in iteration 425: 0 [2026-03-25 20:39:30,245][__main__][INFO] - agents played in iteration 425 are Bob, Alice [2026-03-25 20:39:30,827][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:39:30,888][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:39:30,889][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:39:30,889][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:39:31,687][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:39:32,299][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:39:32,958][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:39:33,615][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:39:34,276][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:39:34,934][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:39:35,591][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:39:36,249][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:39:36,907][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:39:37,566][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:39:38,223][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:39:38,884][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:39:39,543][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:39:40,202][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:39:40,859][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:39:41,518][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:39:42,176][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:39:42,835][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:39:43,493][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:39:44,151][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:39:44,809][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:39:45,467][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:39:46,125][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:39:46,783][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:39:47,441][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:39:48,100][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:39:48,759][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:39:49,418][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:39:50,076][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:39:50,734][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:39:51,393][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:39:52,050][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:39:52,708][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:39:53,368][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:39:54,027][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:39:54,684][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:39:55,342][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:39:56,002][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:39:56,660][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:39:57,318][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:39:57,976][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:39:58,635][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:39:59,294][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:39:59,952][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:40:00,610][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:40:01,271][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:40:01,930][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:40:02,589][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:40:03,577][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:40:04,236][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:40:04,895][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:40:05,554][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:40:06,211][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:40:06,870][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:40:07,529][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:40:08,187][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:40:08,845][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:40:09,504][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:40:10,161][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:40:10,819][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:40:11,479][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:40:12,139][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:40:12,798][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:40:13,460][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:40:14,120][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:40:14,940][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 20:40:16,388][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:40:16,392][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:40:16,394][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:40:17,909][__main__][INFO] - Iteration 426 took 56s (15.35% Gen, 81.96% Train). Generation: 8s, Training: 46s. Estimated remaining time: 9h 4m 17s. Estimated total time: 15h 38m 30s. Time estimates for 10 more iterations: 9m 23s, 100 more iterations: 1h 33m 51s, 500 more iterations: 7h 49m 15s. [2026-03-25 20:40:17,912][__main__][INFO] - Starting iteration 426. [2026-03-25 20:40:17,917][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 20:40:17,917][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:40:23,293][__main__][INFO] - Number of regex retries in iteration 426: 0 [2026-03-25 20:40:23,294][__main__][INFO] - agents played in iteration 426 are Bob, Alice [2026-03-25 20:40:23,962][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:40:24,023][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:40:24,023][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:40:24,024][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:40:24,710][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:40:25,364][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:40:25,980][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:40:26,641][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:40:27,300][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:40:27,960][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:40:28,620][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:40:29,280][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:40:29,940][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:40:30,598][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:40:31,258][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:40:31,918][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:40:32,578][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:40:33,237][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:40:33,896][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:40:34,556][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:40:35,215][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:40:35,875][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:40:36,534][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:40:37,195][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:40:37,854][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:40:38,512][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:40:39,171][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:40:39,830][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:40:40,489][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:40:41,149][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:40:41,809][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:40:42,468][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:40:43,131][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:40:43,793][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:40:44,452][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:40:45,111][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:40:45,770][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:40:46,431][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:40:47,089][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:40:47,750][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:40:48,411][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:40:49,071][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:40:49,733][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:40:50,393][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:40:51,053][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:40:51,712][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:40:52,370][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:40:53,029][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:40:53,688][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:40:54,347][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:40:55,006][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:40:55,666][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:40:56,679][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:40:57,338][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:40:57,996][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:40:58,655][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:40:59,314][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:40:59,972][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:41:00,630][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:41:01,288][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:41:01,946][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:41:02,604][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:41:03,263][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:41:03,922][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:41:04,580][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:41:05,238][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:41:05,896][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:41:06,554][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:41:07,212][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:41:08,023][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 20:41:09,463][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:41:09,466][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:41:09,468][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:41:10,999][__main__][INFO] - Iteration 427 took 53s (10.13% Gen, 86.98% Train). Generation: 5s, Training: 46s. Estimated remaining time: 8h 9m 39s. Estimated total time: 14h 44m 44s. Time estimates for 10 more iterations: 8m 50s, 100 more iterations: 1h 28m 28s, 500 more iterations: 7h 22m 22s. [2026-03-25 20:41:11,002][__main__][INFO] - Starting iteration 427. [2026-03-25 20:41:11,006][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 20:41:11,006][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:41:16,537][__main__][INFO] - Number of regex retries in iteration 427: 0 [2026-03-25 20:41:16,538][__main__][INFO] - agents played in iteration 427 are Bob, Alice [2026-03-25 20:41:17,030][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:41:17,091][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:41:17,092][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:41:17,092][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:41:17,798][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:41:18,408][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:41:19,068][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:41:19,726][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:41:20,384][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:41:21,042][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:41:21,701][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:41:22,359][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:41:23,018][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:41:23,678][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:41:24,336][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:41:24,994][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:41:25,656][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:41:26,315][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:41:26,975][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:41:27,635][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:41:28,297][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:41:28,958][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:41:29,617][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:41:30,278][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:41:30,937][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:41:31,598][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:41:32,257][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:41:32,916][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:41:33,577][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:41:34,236][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:41:34,896][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:41:35,560][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:41:36,219][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:41:36,878][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:41:37,537][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:41:38,196][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:41:38,856][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:41:39,517][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:41:40,176][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:41:40,835][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:41:41,494][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:41:42,153][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:41:42,812][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:41:43,472][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:41:44,131][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:41:44,790][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:41:45,448][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:41:46,108][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:41:46,767][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:41:47,426][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:41:48,085][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:41:48,746][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:41:49,729][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:41:50,388][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:41:51,047][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:41:51,706][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:41:52,364][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:41:53,023][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:41:53,681][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:41:54,339][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:41:54,997][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:41:55,655][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:41:56,313][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:41:56,971][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:41:57,629][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:41:58,287][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:41:58,946][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:41:59,604][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:42:00,263][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:42:01,130][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 20:42:02,514][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:42:02,516][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:42:02,517][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:42:04,017][__main__][INFO] - Iteration 428 took 53s (10.44% Gen, 86.73% Train). Generation: 5s, Training: 45s. Estimated remaining time: 8h 7m 34s. Estimated total time: 14h 43m 32s. Time estimates for 10 more iterations: 8m 50s, 100 more iterations: 1h 28m 21s, 500 more iterations: 7h 21m 46s. [2026-03-25 20:42:04,019][__main__][INFO] - Starting iteration 428. [2026-03-25 20:42:04,023][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 20:42:04,024][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:42:10,211][__main__][INFO] - Number of regex retries in iteration 428: 0 [2026-03-25 20:42:10,212][__main__][INFO] - agents played in iteration 428 are Bob, Alice [2026-03-25 20:42:10,710][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:42:10,772][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:42:10,773][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:42:10,773][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:42:11,473][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:42:12,092][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:42:12,752][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:42:13,411][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:42:14,070][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:42:14,728][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:42:15,386][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:42:16,044][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:42:16,703][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:42:17,361][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:42:18,019][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:42:18,677][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:42:19,335][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:42:19,993][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:42:20,652][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:42:21,311][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:42:21,969][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:42:22,627][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:42:23,286][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:42:23,944][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:42:24,602][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:42:25,259][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:42:25,918][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:42:26,577][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:42:27,235][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:42:27,894][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:42:28,553][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:42:29,212][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:42:29,869][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:42:30,527][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:42:31,186][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:42:31,844][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:42:32,503][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:42:33,162][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:42:33,821][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:42:34,479][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:42:35,137][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:42:35,796][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:42:36,454][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:42:37,112][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:42:37,771][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:42:38,429][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:42:39,087][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:42:39,745][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:42:40,404][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:42:41,061][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:42:41,719][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:42:42,378][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:42:43,391][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:42:44,050][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:42:44,709][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:42:45,369][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:42:46,028][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:42:46,687][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:42:47,347][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:42:48,007][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:42:48,667][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:42:49,326][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:42:49,987][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:42:50,647][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:42:51,309][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:42:51,969][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:42:52,628][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:42:53,287][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:42:53,948][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:42:54,726][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 20:42:56,433][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:42:56,436][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:42:56,437][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:42:57,928][__main__][INFO] - Iteration 429 took 53s (11.48% Gen, 85.75% Train). Generation: 6s, Training: 46s. Estimated remaining time: 8h 21m 33s. Estimated total time: 14h 58m 26s. Time estimates for 10 more iterations: 8m 59s, 100 more iterations: 1h 29m 50s, 500 more iterations: 7h 29m 13s. [2026-03-25 20:42:57,930][__main__][INFO] - Starting iteration 429. [2026-03-25 20:42:57,934][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 20:42:57,934][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:43:03,709][__main__][INFO] - Number of regex retries in iteration 429: 0 [2026-03-25 20:43:03,710][__main__][INFO] - agents played in iteration 429 are Bob, Alice [2026-03-25 20:43:04,304][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:43:04,365][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:43:04,366][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:43:04,366][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:43:05,122][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:43:05,738][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:43:06,398][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:43:07,058][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:43:07,719][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:43:08,378][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:43:09,037][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:43:09,696][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:43:10,356][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:43:11,015][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:43:11,675][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:43:12,337][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:43:12,996][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:43:13,655][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:43:14,314][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:43:14,973][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:43:15,633][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:43:16,294][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:43:16,954][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:43:17,614][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:43:18,274][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:43:18,933][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:43:19,592][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:43:20,252][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:43:20,911][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:43:21,570][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:43:22,229][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:43:22,887][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:43:23,547][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:43:24,207][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:43:24,867][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:43:25,525][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:43:26,186][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:43:26,846][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:43:27,505][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:43:28,165][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:43:28,825][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:43:29,486][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:43:30,145][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:43:30,805][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:43:31,467][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:43:32,127][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:43:32,788][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:43:33,449][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:43:34,110][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:43:34,770][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:43:35,432][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:43:36,091][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:43:37,078][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:43:37,739][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:43:38,397][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:43:39,059][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:43:39,718][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:43:40,377][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:43:41,037][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:43:41,697][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:43:42,357][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:43:43,014][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:43:43,674][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:43:44,331][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:43:44,989][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:43:45,648][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:43:46,305][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:43:46,964][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:43:47,623][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:43:48,609][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 20:43:49,935][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:43:49,937][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:43:49,938][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:43:51,493][__main__][INFO] - Iteration 430 took 53s (10.78% Gen, 86.31% Train). Generation: 5s, Training: 46s. Estimated remaining time: 8h 14m 55s. Estimated total time: 14h 52m 41s. Time estimates for 10 more iterations: 8m 55s, 100 more iterations: 1h 29m 16s, 500 more iterations: 7h 26m 20s. [2026-03-25 20:43:51,496][__main__][INFO] - Starting iteration 430. [2026-03-25 20:43:51,500][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 20:43:51,501][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:43:57,210][__main__][INFO] - Number of regex retries in iteration 430: 0 [2026-03-25 20:43:57,211][__main__][INFO] - agents played in iteration 430 are Bob, Alice [2026-03-25 20:43:58,326][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:43:58,388][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:43:58,389][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:43:58,389][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:43:59,089][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:43:59,703][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:44:00,364][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:44:01,024][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:44:01,684][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:44:02,342][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:44:03,002][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:44:03,661][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:44:04,320][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:44:04,979][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:44:05,638][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:44:06,297][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:44:06,956][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:44:07,616][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:44:08,276][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:44:08,936][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:44:09,595][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:44:10,256][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:44:10,916][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:44:11,576][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:44:12,237][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:44:12,897][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:44:13,557][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:44:14,218][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:44:14,879][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:44:15,539][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:44:16,199][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:44:16,860][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:44:17,521][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:44:18,179][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:44:18,838][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:44:19,496][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:44:20,156][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:44:20,814][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:44:21,473][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:44:22,132][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:44:22,791][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:44:23,452][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:44:24,113][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:44:24,772][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:44:25,431][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:44:26,091][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:44:26,752][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:44:27,411][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:44:28,070][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:44:28,729][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:44:29,388][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:44:30,048][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:44:31,036][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:44:31,696][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:44:32,356][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:44:33,016][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:44:33,675][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:44:34,337][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:44:34,997][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:44:35,657][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:44:36,316][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:44:36,976][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:44:37,636][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:44:38,295][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:44:38,954][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:44:39,614][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:44:40,273][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:44:40,934][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:44:41,594][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:44:42,305][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 20:44:43,448][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:44:43,450][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:44:43,451][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:44:44,831][__main__][INFO] - Iteration 431 took 53s (10.71% Gen, 86.70% Train). Generation: 5s, Training: 46s. Estimated remaining time: 8h 10m 13s. Estimated total time: 14h 48m 53s. Time estimates for 10 more iterations: 8m 53s, 100 more iterations: 1h 28m 53s, 500 more iterations: 7h 24m 26s. [2026-03-25 20:44:44,833][__main__][INFO] - Starting iteration 431. [2026-03-25 20:44:44,837][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 20:44:44,837][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:44:50,540][__main__][INFO] - Number of regex retries in iteration 431: 0 [2026-03-25 20:44:50,541][__main__][INFO] - agents played in iteration 431 are Bob, Alice [2026-03-25 20:44:51,045][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:44:51,106][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:44:51,106][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:44:51,107][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:44:51,918][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:44:52,614][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:44:53,227][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:44:53,887][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:44:54,547][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:44:55,206][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:44:55,866][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:44:56,526][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:44:57,184][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:44:57,845][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:44:58,504][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:44:59,164][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:44:59,823][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:45:00,482][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:45:01,142][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:45:01,802][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:45:02,460][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:45:03,119][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:45:03,779][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:45:04,438][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:45:05,097][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:45:05,757][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:45:06,416][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:45:07,074][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:45:07,733][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:45:08,391][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:45:09,050][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:45:09,709][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:45:10,369][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:45:11,028][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:45:11,687][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:45:12,346][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:45:13,004][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:45:13,663][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:45:14,322][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:45:14,981][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:45:15,639][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:45:16,298][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:45:16,959][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:45:17,618][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:45:18,278][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:45:18,937][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:45:19,595][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:45:20,254][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:45:20,918][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:45:21,578][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:45:22,236][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:45:22,895][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:45:23,910][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:45:24,570][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:45:25,226][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:45:25,885][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:45:26,543][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:45:27,202][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:45:27,861][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:45:28,520][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:45:29,179][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:45:29,837][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:45:30,496][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:45:31,156][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:45:31,816][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:45:32,475][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:45:33,133][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:45:33,792][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:45:34,450][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:45:35,241][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 20:45:37,825][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:45:37,827][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:45:37,828][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:45:39,137][__main__][INFO] - Iteration 432 took 54s (10.50% Gen, 87.08% Train). Generation: 5s, Training: 47s. Estimated remaining time: 8h 25m 28s. Estimated total time: 15h 5m 2s. Time estimates for 10 more iterations: 9m 3s, 100 more iterations: 1h 30m 30s, 500 more iterations: 7h 32m 31s. [2026-03-25 20:45:39,140][__main__][INFO] - Starting iteration 432. [2026-03-25 20:45:39,143][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 20:45:39,143][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:45:43,894][__main__][INFO] - Number of regex retries in iteration 432: 0 [2026-03-25 20:45:43,895][__main__][INFO] - agents played in iteration 432 are Bob, Alice [2026-03-25 20:45:44,392][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:45:44,452][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:45:44,453][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:45:44,454][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:45:45,116][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:45:45,730][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:45:46,390][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:45:47,048][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:45:47,708][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:45:48,366][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:45:49,024][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:45:49,684][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:45:50,345][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:45:51,004][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:45:51,662][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:45:52,321][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:45:52,980][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:45:53,641][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:45:54,299][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:45:54,958][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:45:55,617][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:45:56,275][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:45:56,935][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:45:57,594][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:45:58,255][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:45:58,914][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:45:59,573][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:46:00,232][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:46:00,892][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:46:01,552][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:46:02,211][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:46:02,871][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:46:03,530][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:46:04,190][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:46:04,849][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:46:05,510][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:46:06,169][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:46:06,828][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:46:07,488][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:46:08,147][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:46:08,806][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:46:09,465][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:46:10,126][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:46:10,785][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:46:11,444][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:46:12,103][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:46:12,764][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:46:13,422][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:46:14,082][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:46:14,741][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:46:15,400][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:46:16,059][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:46:17,043][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:46:17,702][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:46:18,360][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:46:19,019][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:46:19,676][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:46:20,334][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:46:20,993][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:46:21,654][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:46:22,312][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:46:22,971][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:46:23,630][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:46:24,293][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:46:24,951][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:46:25,608][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:46:26,267][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:46:26,924][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:46:27,582][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:46:28,395][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 20:46:29,779][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:46:29,781][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:46:29,782][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:46:31,161][__main__][INFO] - Iteration 433 took 52s (9.13% Gen, 88.21% Train). Generation: 4s, Training: 45s. Estimated remaining time: 7h 46m 33s. Estimated total time: 14h 26m 59s. Time estimates for 10 more iterations: 8m 40s, 100 more iterations: 1h 26m 41s, 500 more iterations: 7h 13m 29s. [2026-03-25 20:46:31,163][__main__][INFO] - Starting iteration 433. [2026-03-25 20:46:31,167][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 20:46:31,167][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:46:35,973][__main__][INFO] - Number of regex retries in iteration 433: 0 [2026-03-25 20:46:35,974][__main__][INFO] - agents played in iteration 433 are Bob, Alice [2026-03-25 20:46:36,458][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:46:36,520][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:46:36,521][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:46:36,521][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:46:37,353][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:46:37,986][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:46:38,647][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:46:39,308][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:46:39,968][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:46:40,627][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:46:41,286][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:46:41,946][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:46:42,606][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:46:43,267][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:46:43,926][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:46:44,586][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:46:45,245][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:46:45,905][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:46:46,564][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:46:47,222][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:46:47,881][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:46:48,542][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:46:49,201][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:46:49,859][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:46:50,518][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:46:51,178][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:46:51,839][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:46:52,498][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:46:53,157][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:46:53,817][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:46:54,477][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:46:55,136][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:46:55,796][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:46:56,456][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:46:57,115][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:46:57,776][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:46:58,436][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:46:59,096][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:46:59,757][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:47:00,417][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:47:01,077][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:47:01,739][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:47:02,400][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:47:03,060][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:47:03,721][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:47:04,382][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:47:05,043][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:47:05,703][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:47:06,362][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:47:07,022][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:47:07,682][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:47:09,928][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:47:10,909][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:47:11,569][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:47:12,229][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:47:12,887][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:47:13,547][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:47:14,205][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:47:14,863][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:47:15,521][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:47:16,180][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:47:16,838][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:47:17,497][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:47:18,155][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:47:18,814][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:47:19,471][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:47:20,130][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:47:20,789][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:47:21,447][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:47:22,232][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:44 [2026-03-25 20:47:23,597][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:47:23,600][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:47:23,601][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:47:25,060][__main__][INFO] - Iteration 434 took 53s (8.92% Gen, 88.37% Train). Generation: 4s, Training: 47s. Estimated remaining time: 8h 16m 55s. Estimated total time: 14h 58m 15s. Time estimates for 10 more iterations: 8m 58s, 100 more iterations: 1h 29m 49s, 500 more iterations: 7h 29m 7s. [2026-03-25 20:47:25,062][__main__][INFO] - Starting iteration 434. [2026-03-25 20:47:25,067][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 20:47:25,067][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:47:31,133][__main__][INFO] - Number of regex retries in iteration 434: 0 [2026-03-25 20:47:31,134][__main__][INFO] - agents played in iteration 434 are Bob, Alice [2026-03-25 20:47:31,652][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:47:31,715][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:47:31,716][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:47:31,716][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:47:32,488][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:47:33,106][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:47:33,766][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:47:34,423][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:47:35,080][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:47:35,740][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:47:36,397][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:47:37,056][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:47:37,715][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:47:38,373][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:47:39,031][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:47:39,690][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:47:40,347][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:47:41,006][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:47:41,665][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:47:42,323][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:47:42,980][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:47:43,638][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:47:44,297][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:47:44,955][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:47:45,612][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:47:46,271][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:47:46,928][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:47:47,586][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:47:48,244][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:47:48,902][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:47:49,561][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:47:50,219][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:47:50,877][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:47:51,535][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:47:52,193][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:47:52,852][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:47:53,514][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:47:54,172][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:47:54,830][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:47:55,490][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:47:56,147][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:47:56,805][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:47:57,462][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:47:58,121][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:47:58,780][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:47:59,439][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:48:00,097][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:48:00,757][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:48:01,415][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:48:02,073][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:48:02,732][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:48:03,390][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:48:04,377][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:48:05,036][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:48:05,694][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:48:06,354][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:48:07,013][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:48:07,670][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:48:09,251][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:48:09,910][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:48:10,568][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:48:11,227][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:48:11,885][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:48:12,544][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:48:13,202][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:48:13,860][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:48:14,518][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:48:15,176][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:48:15,834][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:48:16,622][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:44 [2026-03-25 20:48:18,072][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:48:18,074][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:48:18,076][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:48:19,707][__main__][INFO] - Iteration 435 took 54s (11.10% Gen, 85.91% Train). Generation: 6s, Training: 46s. Estimated remaining time: 8h 28m 28s. Estimated total time: 15h 10m 42s. Time estimates for 10 more iterations: 9m 6s, 100 more iterations: 1h 31m 4s, 500 more iterations: 7h 35m 21s. [2026-03-25 20:48:19,709][__main__][INFO] - Starting iteration 435. [2026-03-25 20:48:19,714][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 20:48:19,714][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:48:27,753][__main__][INFO] - Number of regex retries in iteration 435: 0 [2026-03-25 20:48:27,755][__main__][INFO] - agents played in iteration 435 are Bob, Alice [2026-03-25 20:48:28,376][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:48:28,441][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:48:28,441][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:48:28,442][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:48:29,146][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:48:29,760][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:48:30,419][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:48:31,076][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:48:31,733][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:48:32,390][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:48:33,049][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:48:33,708][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:48:34,365][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:48:35,023][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:48:35,682][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:48:36,340][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:48:36,997][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:48:37,655][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:48:38,319][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:48:38,976][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:48:39,632][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:48:40,290][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:48:40,948][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:48:41,605][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:48:42,264][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:48:42,921][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:48:43,581][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:48:44,239][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:48:44,896][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:48:45,554][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:48:46,214][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:48:46,872][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:48:47,530][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:48:48,189][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:48:48,848][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:48:49,506][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:48:50,163][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:48:50,821][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:48:51,479][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:48:52,137][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:48:52,795][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:48:53,454][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:48:54,112][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:48:54,770][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:48:55,428][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:48:56,085][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:48:56,743][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:48:57,402][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:48:58,061][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:48:58,719][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:48:59,378][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:49:00,036][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:49:01,020][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:49:01,681][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:49:02,340][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:49:02,998][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:49:03,657][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:49:04,316][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:49:05,490][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:49:06,148][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:49:06,807][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:49:07,465][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:49:08,123][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:49:08,781][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:49:09,439][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:49:10,098][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:49:10,755][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:49:11,414][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:49:12,072][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:49:12,855][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 20:49:14,182][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:49:14,185][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:49:14,186][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:49:15,667][__main__][INFO] - Iteration 436 took 55s (14.37% Gen, 82.98% Train). Generation: 8s, Training: 46s. Estimated remaining time: 8h 49m 25s. Estimated total time: 15h 32m 35s. Time estimates for 10 more iterations: 9m 19s, 100 more iterations: 1h 33m 15s, 500 more iterations: 7h 46m 17s. [2026-03-25 20:49:15,670][__main__][INFO] - Starting iteration 436. [2026-03-25 20:49:15,673][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 20:49:15,674][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:49:20,600][__main__][INFO] - Number of regex retries in iteration 436: 0 [2026-03-25 20:49:20,601][__main__][INFO] - agents played in iteration 436 are Bob, Alice [2026-03-25 20:49:21,121][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:49:21,182][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:49:21,183][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:49:21,184][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:49:22,065][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:49:22,684][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:49:23,344][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:49:24,002][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:49:25,219][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:49:25,877][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:49:26,536][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:49:27,194][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:49:27,853][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:49:28,513][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:49:29,170][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:49:29,830][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:49:30,489][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:49:31,146][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:49:31,805][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:49:32,464][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:49:33,122][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:49:33,781][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:49:34,439][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:49:35,098][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:49:35,756][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:49:36,413][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:49:37,072][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:49:37,729][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:49:38,387][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:49:39,045][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:49:39,704][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:49:40,362][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:49:41,019][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:49:41,678][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:49:42,335][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:49:42,993][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:49:43,651][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:49:44,310][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:49:44,970][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:49:45,629][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:49:46,287][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:49:46,944][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:49:47,602][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:49:48,259][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:49:48,917][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:49:49,575][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:49:50,233][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:49:50,891][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:49:51,549][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:49:52,207][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:49:52,865][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:49:53,525][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:49:54,513][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:49:55,172][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:49:55,830][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:49:56,489][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:49:57,148][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:49:57,806][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:49:58,464][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:49:59,122][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:49:59,780][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:50:00,438][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:50:01,096][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:50:01,754][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:50:02,413][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:50:03,072][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:50:03,732][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:50:04,391][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:50:05,050][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:50:05,813][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 20:50:07,258][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:50:07,261][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:50:07,262][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:50:08,764][__main__][INFO] - Iteration 437 took 53s (9.28% Gen, 87.89% Train). Generation: 4s, Training: 46s. Estimated remaining time: 8h 0m 49s. Estimated total time: 14h 44m 52s. Time estimates for 10 more iterations: 8m 50s, 100 more iterations: 1h 28m 29s, 500 more iterations: 7h 22m 26s. [2026-03-25 20:50:08,766][__main__][INFO] - Starting iteration 437. [2026-03-25 20:50:08,770][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 20:50:08,771][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:50:14,374][__main__][INFO] - Number of regex retries in iteration 437: 0 [2026-03-25 20:50:14,375][__main__][INFO] - agents played in iteration 437 are Bob, Alice [2026-03-25 20:50:15,238][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:50:15,298][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:50:15,299][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:50:15,299][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:50:16,088][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:50:16,704][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:50:17,363][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:50:18,023][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:50:18,682][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:50:19,340][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:50:20,000][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:50:20,659][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:50:21,321][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:50:21,980][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:50:22,639][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:50:23,297][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:50:23,957][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:50:24,615][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:50:25,274][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:50:25,932][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:50:27,787][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:50:28,445][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:50:29,105][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:50:29,764][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:50:30,421][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:50:31,080][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:50:31,738][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:50:32,396][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:50:33,053][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:50:33,712][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:50:34,372][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:50:35,029][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:50:35,688][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:50:36,345][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:50:37,003][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:50:37,662][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:50:38,321][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:50:38,978][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:50:39,637][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:50:40,785][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:50:41,444][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:50:42,102][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:50:42,759][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:50:43,418][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:50:44,075][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:50:44,733][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:50:45,398][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:50:46,056][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:50:46,715][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:50:47,373][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:50:48,032][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:50:48,690][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:50:49,678][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:50:50,337][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:50:50,995][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:50:51,654][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:50:52,312][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:50:52,970][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:50:53,628][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:50:54,286][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:50:54,944][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:50:55,604][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:50:56,261][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:50:56,919][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:50:57,577][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:50:58,237][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:50:58,895][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:50:59,555][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:51:00,213][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:51:00,997][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:44 [2026-03-25 20:51:02,343][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:51:02,346][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:51:02,347][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:51:03,831][__main__][INFO] - Iteration 438 took 55s (10.18% Gen, 87.12% Train). Generation: 5s, Training: 47s. Estimated remaining time: 8h 32m 43s. Estimated total time: 15h 17m 42s. Time estimates for 10 more iterations: 9m 10s, 100 more iterations: 1h 31m 46s, 500 more iterations: 7h 38m 51s. [2026-03-25 20:51:03,833][__main__][INFO] - Starting iteration 438. [2026-03-25 20:51:03,837][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 20:51:03,837][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:51:18,927][__main__][INFO] - Number of regex retries in iteration 438: 0 [2026-03-25 20:51:18,928][__main__][INFO] - agents played in iteration 438 are Bob, Alice [2026-03-25 20:51:19,420][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:51:19,483][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:51:19,483][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:51:19,484][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:51:20,191][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:51:20,811][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:51:21,471][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:51:22,129][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:51:22,787][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:51:23,444][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:51:24,103][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:51:24,762][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:51:25,422][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:51:26,079][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:51:26,738][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:51:27,396][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:51:28,054][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:51:28,715][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:51:29,373][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:51:30,034][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:51:30,692][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:51:31,350][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:51:32,008][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:51:32,667][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:51:33,327][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:51:33,986][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:51:34,645][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:51:35,305][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:51:35,964][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:51:36,622][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:51:37,281][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:51:37,939][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:51:38,598][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:51:39,256][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:51:39,915][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:51:40,573][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:51:41,232][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:51:41,890][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:51:42,550][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:51:43,209][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:51:43,868][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:51:44,526][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:51:45,185][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:51:45,844][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:51:46,502][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:51:47,161][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:51:47,819][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:51:48,478][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:51:49,136][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:51:49,794][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:51:50,453][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:51:51,111][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:51:52,100][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:51:52,762][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:51:53,422][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:51:54,079][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:51:54,742][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:51:55,399][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:51:56,058][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:51:56,717][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:51:57,377][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:51:58,036][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:51:58,695][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:51:59,353][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:52:00,011][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:52:00,669][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:52:01,328][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:52:01,987][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:52:02,644][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:52:03,429][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 20:52:04,875][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:52:04,878][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:52:04,879][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:52:06,614][__main__][INFO] - Iteration 439 took 1m 2s (24.04% Gen, 73.19% Train). Generation: 15s, Training: 45s. Estimated remaining time: 10h 40m 17s. Estimated total time: 17h 26m 18s. Time estimates for 10 more iterations: 10m 27s, 100 more iterations: 1h 44m 37s, 500 more iterations: 8h 43m 9s. [2026-03-25 20:52:06,616][__main__][INFO] - Starting iteration 439. [2026-03-25 20:52:06,620][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 20:52:06,621][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:52:12,450][__main__][INFO] - Number of regex retries in iteration 439: 0 [2026-03-25 20:52:12,451][__main__][INFO] - agents played in iteration 439 are Bob, Alice [2026-03-25 20:52:12,934][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:52:12,996][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:52:12,997][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:52:12,997][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:52:13,710][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:52:14,334][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:52:14,994][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:52:15,653][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:52:16,311][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:52:16,970][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:52:17,629][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:52:18,285][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:52:18,944][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:52:19,602][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:52:20,259][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:52:20,917][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:52:21,575][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:52:22,235][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:52:22,894][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:52:23,551][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:52:24,210][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:52:24,869][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:52:25,526][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:52:26,184][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:52:26,842][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:52:27,500][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:52:28,159][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:52:28,818][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:52:29,476][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:52:30,134][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:52:30,792][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:52:31,449][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:52:32,108][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:52:32,766][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:52:33,425][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:52:34,084][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:52:34,739][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:52:35,396][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:52:36,055][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:52:36,712][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:52:37,371][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:52:38,029][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:52:38,687][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:52:39,345][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:52:40,005][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:52:40,664][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:52:41,322][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:52:41,980][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:52:42,639][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:52:43,298][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:52:43,956][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:52:44,615][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:52:45,597][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:52:46,257][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:52:46,917][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:52:47,578][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:52:48,239][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:52:48,896][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:52:49,554][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:52:50,214][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:52:50,872][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:52:51,534][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:52:52,192][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:52:52,850][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:52:53,509][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:52:54,167][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:52:54,824][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:52:55,482][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:52:56,139][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:52:57,059][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 20:52:58,484][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:52:58,486][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:52:58,488][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:52:59,985][__main__][INFO] - Iteration 440 took 53s (10.93% Gen, 86.26% Train). Generation: 5s, Training: 46s. Estimated remaining time: 8h 2m 32s. Estimated total time: 14h 49m 26s. Time estimates for 10 more iterations: 8m 53s, 100 more iterations: 1h 28m 56s, 500 more iterations: 7h 24m 43s. [2026-03-25 20:52:59,987][__main__][INFO] - Starting iteration 440. [2026-03-25 20:52:59,992][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 20:52:59,993][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:53:14,382][__main__][INFO] - Number of regex retries in iteration 440: 0 [2026-03-25 20:53:14,384][__main__][INFO] - agents played in iteration 440 are Bob, Alice [2026-03-25 20:53:15,516][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:53:15,578][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:53:15,579][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:53:15,580][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:53:16,267][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:53:16,895][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:53:17,554][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:53:18,211][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:53:18,868][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:53:19,526][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:53:20,183][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:53:20,842][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:53:21,499][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:53:22,157][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:53:22,816][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:53:23,473][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:53:24,132][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:53:24,789][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:53:25,450][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:53:26,108][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:53:26,766][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:53:27,424][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:53:28,082][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:53:28,740][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:53:29,398][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:53:30,056][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:53:30,715][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:53:31,372][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:53:32,030][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:53:32,688][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:53:33,346][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:53:34,003][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:53:34,661][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:53:35,319][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:53:35,977][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:53:36,634][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:53:37,292][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:53:37,949][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:53:38,607][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:53:39,266][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:53:39,924][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:53:40,587][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:53:41,245][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:53:41,903][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:53:42,560][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:53:43,219][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:53:43,877][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:53:44,535][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:53:45,193][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:53:45,850][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:53:46,508][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:53:47,166][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:53:48,149][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:53:48,808][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:53:49,467][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:53:50,126][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:53:50,786][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:53:51,443][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:53:52,102][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:53:52,762][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:53:53,420][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:53:54,078][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:53:54,739][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:53:55,398][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:53:56,057][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:53:56,717][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:53:57,376][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:53:58,035][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:53:58,694][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:53:59,490][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 20:54:01,026][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:54:01,028][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:54:01,029][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:54:02,453][__main__][INFO] - Iteration 441 took 1m 2s (23.04% Gen, 74.68% Train). Generation: 14s, Training: 46s. Estimated remaining time: 10h 33m 6s. Estimated total time: 17h 21m 3s. Time estimates for 10 more iterations: 10m 24s, 100 more iterations: 1h 44m 6s, 500 more iterations: 8h 40m 31s. [2026-03-25 20:54:02,455][__main__][INFO] - Starting iteration 441. [2026-03-25 20:54:02,458][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 20:54:02,459][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:54:07,852][__main__][INFO] - Number of regex retries in iteration 441: 0 [2026-03-25 20:54:07,853][__main__][INFO] - agents played in iteration 441 are Bob, Alice [2026-03-25 20:54:08,627][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:54:08,692][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:54:08,692][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:54:08,693][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:54:09,401][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:54:10,017][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:54:10,679][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:54:11,338][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:54:11,996][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:54:12,655][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:54:13,319][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:54:13,978][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:54:14,640][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:54:15,299][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:54:15,957][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:54:16,616][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:54:17,275][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:54:17,933][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:54:18,591][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:54:19,249][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:54:19,908][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:54:20,567][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:54:21,225][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:54:21,885][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:54:22,543][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:54:23,201][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:54:23,859][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:54:24,518][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:54:25,176][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:54:25,834][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:54:26,493][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:54:27,151][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:54:27,811][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:54:28,468][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:54:29,127][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:54:29,786][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:54:30,445][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:54:31,103][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:54:31,762][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:54:32,426][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:54:33,085][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:54:33,744][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:54:34,403][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:54:35,061][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:54:35,719][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:54:36,376][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:54:37,036][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:54:37,694][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:54:38,352][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:54:39,009][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:54:39,669][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:54:40,327][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:54:41,320][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:54:41,978][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:54:42,637][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:54:43,295][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:54:43,955][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:54:44,614][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:54:45,273][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:54:45,931][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:54:46,591][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:54:47,249][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:54:47,907][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:54:48,565][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:54:49,224][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:54:49,883][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:54:50,543][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:54:51,201][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:54:51,861][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:54:52,652][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 20:54:53,984][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:54:53,986][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:54:53,987][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:54:55,524][__main__][INFO] - Iteration 442 took 53s (10.16% Gen, 86.94% Train). Generation: 5s, Training: 46s. Estimated remaining time: 7h 55m 37s. Estimated total time: 14h 44m 27s. Time estimates for 10 more iterations: 8m 50s, 100 more iterations: 1h 28m 26s, 500 more iterations: 7h 22m 13s. [2026-03-25 20:54:55,526][__main__][INFO] - Starting iteration 442. [2026-03-25 20:54:55,532][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 20:54:55,532][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:55:00,744][__main__][INFO] - Number of regex retries in iteration 442: 0 [2026-03-25 20:55:00,745][__main__][INFO] - agents played in iteration 442 are Bob, Alice [2026-03-25 20:55:01,288][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:55:01,349][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:55:01,349][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:55:01,350][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:55:02,163][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:55:02,786][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:55:03,448][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:55:04,107][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:55:04,767][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:55:05,428][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:55:06,089][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:55:06,748][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:55:07,409][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:55:08,069][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:55:08,730][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:55:09,391][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:55:10,051][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:55:10,710][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:55:11,369][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:55:12,028][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:55:12,688][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:55:13,348][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:55:14,007][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:55:14,667][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:55:15,326][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:55:15,985][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:55:16,644][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:55:17,303][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:55:17,964][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:55:18,625][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:55:19,284][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:55:19,943][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:55:20,602][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:55:21,263][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:55:21,922][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:55:22,582][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:55:23,244][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:55:23,904][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:55:24,562][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:55:25,223][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:55:25,882][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:55:26,546][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:55:27,205][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:55:27,866][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:55:28,525][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:55:29,184][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:55:29,844][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:55:30,503][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:55:31,163][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:55:31,822][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:55:32,481][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:55:33,140][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:55:34,130][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:55:34,789][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:55:35,448][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:55:36,106][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:55:36,764][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:55:37,424][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:55:38,082][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:55:38,741][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:55:39,399][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:55:40,060][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:55:40,718][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:55:41,376][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:55:42,034][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:55:42,694][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:55:43,353][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:55:44,011][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:55:44,669][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:55:45,668][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 20:55:47,013][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:55:47,016][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:55:47,018][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:55:48,580][__main__][INFO] - Iteration 443 took 53s (9.83% Gen, 87.23% Train). Generation: 5s, Training: 46s. Estimated remaining time: 7h 54m 26s. Estimated total time: 14h 44m 9s. Time estimates for 10 more iterations: 8m 50s, 100 more iterations: 1h 28m 24s, 500 more iterations: 7h 22m 4s. [2026-03-25 20:55:48,582][__main__][INFO] - Starting iteration 443. [2026-03-25 20:55:48,586][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 20:55:48,587][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:55:53,906][__main__][INFO] - Number of regex retries in iteration 443: 0 [2026-03-25 20:55:53,907][__main__][INFO] - agents played in iteration 443 are Bob, Alice [2026-03-25 20:55:54,545][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:55:54,607][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:55:54,608][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:55:54,608][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:55:55,406][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:55:56,012][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:55:56,672][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:55:57,332][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:55:57,992][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:55:58,652][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:55:59,313][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:55:59,972][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:56:00,633][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:56:01,293][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:56:02,374][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:56:03,033][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:56:03,691][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:56:04,351][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:56:05,010][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:56:05,668][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:56:06,327][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:56:06,984][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:56:07,643][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:56:08,304][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:56:08,963][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:56:09,621][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:56:10,280][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:56:10,939][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:56:11,599][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:56:12,259][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:56:12,918][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:56:13,578][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:56:14,238][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:56:14,897][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:56:15,557][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:56:16,215][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:56:16,873][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:56:17,533][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:56:18,191][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:56:18,850][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:56:19,511][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:56:20,170][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:56:20,828][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:56:23,461][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:56:24,120][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:56:24,778][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:56:25,443][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:56:26,101][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:56:26,760][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:56:27,419][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:56:28,078][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:56:28,737][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:56:29,732][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:56:30,391][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:56:31,049][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:56:31,707][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:56:32,366][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:56:33,025][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:56:33,683][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:56:34,341][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:56:35,001][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:56:35,660][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:56:36,320][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:56:36,980][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:56:37,638][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:56:38,296][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:56:38,955][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:56:39,614][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:56:40,273][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:56:41,044][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:45 [2026-03-25 20:56:42,115][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:56:42,118][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:56:42,119][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:56:43,337][__main__][INFO] - Iteration 444 took 54s (9.72% Gen, 88.05% Train). Generation: 5s, Training: 48s. Estimated remaining time: 8h 21m 54s. Estimated total time: 15h 12m 32s. Time estimates for 10 more iterations: 9m 7s, 100 more iterations: 1h 31m 15s, 500 more iterations: 7h 36m 16s. [2026-03-25 20:56:43,339][__main__][INFO] - Starting iteration 444. [2026-03-25 20:56:43,344][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 20:56:43,344][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:56:50,592][__main__][INFO] - Number of regex retries in iteration 444: 0 [2026-03-25 20:56:50,593][__main__][INFO] - agents played in iteration 444 are Bob, Alice [2026-03-25 20:56:51,854][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:56:51,916][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:56:51,916][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:56:51,917][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:56:52,752][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:56:53,363][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:56:54,022][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:56:54,680][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:56:55,341][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:56:56,004][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:56:56,664][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:56:57,321][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:56:57,980][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:56:58,638][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:56:59,298][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:56:59,955][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:57:00,613][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:57:01,273][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:57:01,932][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:57:02,590][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:57:03,248][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:57:03,906][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:57:04,564][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:57:05,224][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:57:05,882][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:57:06,540][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:57:07,198][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:57:07,856][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:57:08,514][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:57:09,174][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:57:09,833][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:57:10,491][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:57:11,148][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:57:11,807][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:57:12,466][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:57:13,124][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:57:13,789][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:57:14,446][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:57:15,106][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:57:15,766][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:57:16,426][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:57:17,085][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:57:19,840][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:57:20,498][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:57:21,158][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:57:21,818][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:57:22,478][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:57:23,138][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:57:23,800][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:57:24,460][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:57:25,121][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:57:25,780][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:57:26,762][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:57:27,422][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:57:28,082][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:57:28,740][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:57:29,399][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:57:30,058][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:57:30,717][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:57:31,375][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:57:32,033][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:57:32,693][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:57:33,353][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:57:34,010][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:57:34,668][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:57:35,326][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:57:35,984][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:57:36,644][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:57:37,803][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:57:38,580][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:45 [2026-03-25 20:57:40,015][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:57:40,017][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:57:40,018][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:57:41,471][__main__][INFO] - Iteration 445 took 58s (12.47% Gen, 85.03% Train). Generation: 7s, Training: 49s. Estimated remaining time: 9h 17m 13s. Estimated total time: 16h 8m 49s. Time estimates for 10 more iterations: 9m 41s, 100 more iterations: 1h 36m 52s, 500 more iterations: 8h 4m 24s. [2026-03-25 20:57:41,473][__main__][INFO] - Starting iteration 445. [2026-03-25 20:57:41,478][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 20:57:41,478][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:57:47,964][__main__][INFO] - Number of regex retries in iteration 445: 0 [2026-03-25 20:57:47,966][__main__][INFO] - agents played in iteration 445 are Bob, Alice [2026-03-25 20:57:48,555][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:57:48,616][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:57:48,617][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:57:48,617][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:57:49,370][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:57:49,987][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:57:50,645][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:57:51,305][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:57:51,963][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:57:52,622][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:57:53,281][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:57:53,940][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:57:54,599][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:57:55,258][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:57:55,916][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:57:56,575][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:57:57,234][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:57:57,891][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:57:58,550][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:57:59,209][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:57:59,867][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:58:00,525][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:58:01,184][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:58:01,842][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:58:02,500][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:58:03,158][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:58:03,817][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:58:04,478][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:58:05,137][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:58:05,795][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:58:06,453][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:58:07,111][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:58:07,769][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:58:08,428][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:58:09,086][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:58:09,745][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:58:10,403][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:58:11,062][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:58:11,721][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:58:12,380][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:58:13,039][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:58:13,698][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:58:14,358][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:58:15,018][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:58:15,677][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:58:16,337][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:58:16,996][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:58:17,656][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:58:18,315][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:58:18,975][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:58:19,634][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:58:20,293][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:58:21,274][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:58:21,934][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:58:22,595][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:58:23,254][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:58:23,914][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:58:24,575][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:58:25,235][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:58:25,896][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:58:26,557][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:58:27,216][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:58:27,876][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:58:28,540][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:58:29,202][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:58:29,861][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:58:30,522][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:58:31,181][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:58:31,840][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:58:32,717][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 20:58:34,058][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:58:34,061][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:58:34,062][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:58:35,461][__main__][INFO] - Iteration 446 took 53s (12.02% Gen, 85.39% Train). Generation: 6s, Training: 46s. Estimated remaining time: 8h 7m 14s. Estimated total time: 14h 59m 45s. Time estimates for 10 more iterations: 8m 59s, 100 more iterations: 1h 29m 58s, 500 more iterations: 7h 29m 52s. [2026-03-25 20:58:35,463][__main__][INFO] - Starting iteration 446. [2026-03-25 20:58:35,466][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 20:58:35,467][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:58:43,330][__main__][INFO] - Number of regex retries in iteration 446: 0 [2026-03-25 20:58:43,331][__main__][INFO] - agents played in iteration 446 are Bob, Alice [2026-03-25 20:58:43,964][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:58:44,026][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:58:44,027][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:58:44,027][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:58:44,883][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:58:45,500][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:58:46,161][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:58:46,821][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:58:47,485][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:58:48,146][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:58:48,809][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:58:49,469][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:58:50,131][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:58:50,792][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:58:51,453][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:58:52,114][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:58:52,773][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:58:53,434][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:58:54,098][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:58:54,757][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:58:55,418][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:58:56,078][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:58:58,983][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:59:00,416][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:59:01,075][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:59:01,733][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:59:02,392][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:59:03,050][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:59:03,708][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:59:04,366][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:59:05,025][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:59:05,684][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:59:06,343][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:59:07,001][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:59:07,659][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:59:08,318][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:59:08,978][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:59:09,637][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:59:10,296][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:59:10,955][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:59:11,614][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:59:12,274][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:59:12,933][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:59:13,591][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:59:14,252][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:59:14,910][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:59:15,569][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:59:16,227][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:59:16,886][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:59:17,544][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:59:18,203][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:59:18,862][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:59:19,849][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:59:20,509][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:59:21,171][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:59:21,832][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:59:22,490][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:59:23,148][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:59:23,806][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:59:24,464][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:59:25,124][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:59:25,780][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:59:26,437][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:59:27,095][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:59:27,754][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:59:28,413][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:59:29,071][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:59:29,729][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:59:30,388][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:59:31,267][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:46 [2026-03-25 20:59:32,948][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:59:32,951][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:59:32,952][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:59:34,420][__main__][INFO] - Iteration 447 took 58s (13.34% Gen, 84.17% Train). Generation: 7s, Training: 49s. Estimated remaining time: 9h 29m 7s. Estimated total time: 16h 22m 36s. Time estimates for 10 more iterations: 9m 49s, 100 more iterations: 1h 38m 15s, 500 more iterations: 8h 11m 18s. [2026-03-25 20:59:34,423][__main__][INFO] - Starting iteration 447. [2026-03-25 20:59:34,426][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 20:59:34,427][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:59:39,580][__main__][INFO] - Number of regex retries in iteration 447: 0 [2026-03-25 20:59:39,581][__main__][INFO] - agents played in iteration 447 are Bob, Alice [2026-03-25 20:59:40,078][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:59:40,140][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 20:59:40,140][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:59:40,141][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:59:40,907][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:59:41,524][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:59:42,184][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:59:42,844][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:59:43,503][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:59:44,162][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:59:44,821][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:59:45,481][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:59:46,140][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:59:46,798][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:59:47,457][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:59:48,117][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:59:48,778][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:59:49,436][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:59:50,095][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:59:50,754][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:59:51,412][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:59:52,071][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:59:52,729][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:59:53,386][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:59:54,048][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:59:54,707][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:59:55,367][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:59:56,026][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:59:56,684][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:59:57,342][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:59:58,000][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:59:58,659][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:59:59,317][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:59:59,977][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:00:00,635][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:00:01,294][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:00:01,953][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:00:02,611][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:00:03,270][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:00:03,930][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:00:04,588][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:00:05,247][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:00:05,905][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:00:06,565][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:00:07,224][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:00:07,883][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:00:08,543][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:00:09,202][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:00:10,411][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:00:11,070][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:00:11,729][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:00:12,388][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:00:13,378][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:00:14,038][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:00:14,697][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:00:15,356][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:00:16,016][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:00:16,674][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:00:17,333][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:00:17,992][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:00:18,650][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:00:19,311][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:00:19,974][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:00:20,631][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:00:21,290][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:00:21,949][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:00:22,610][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:00:23,268][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:00:23,928][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:00:24,730][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 21:00:26,010][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:00:26,013][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:00:26,014][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:00:27,515][__main__][INFO] - Iteration 448 took 53s (9.71% Gen, 87.46% Train). Generation: 5s, Training: 46s. Estimated remaining time: 7h 50m 28s. Estimated total time: 14h 44m 50s. Time estimates for 10 more iterations: 8m 50s, 100 more iterations: 1h 28m 29s, 500 more iterations: 7h 22m 25s. [2026-03-25 21:00:27,518][__main__][INFO] - Starting iteration 448. [2026-03-25 21:00:27,523][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 21:00:27,524][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:00:33,833][__main__][INFO] - Number of regex retries in iteration 448: 0 [2026-03-25 21:00:33,834][__main__][INFO] - agents played in iteration 448 are Bob, Alice [2026-03-25 21:00:34,966][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:00:35,039][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:00:35,040][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:00:35,040][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:00:35,741][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:00:36,350][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:00:37,006][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:00:37,664][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:00:38,323][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:00:38,981][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:00:39,640][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:00:40,300][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:00:40,959][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:00:41,618][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:00:42,276][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:00:42,935][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:00:43,596][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:00:44,255][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:00:44,916][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:00:45,575][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:00:46,235][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:00:46,895][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:00:47,552][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:00:48,217][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:00:48,875][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:00:49,535][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:00:50,193][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:00:50,852][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:00:51,509][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:00:52,168][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:00:52,827][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:00:53,484][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:00:54,142][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:00:54,800][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:00:55,458][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:00:56,116][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:00:56,776][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:00:57,434][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:00:58,092][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:00:58,751][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:00:59,410][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:01:00,069][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:01:00,727][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:01:01,386][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:01:02,044][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:01:02,702][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:01:03,361][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:01:04,021][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:01:04,679][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:01:05,337][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:01:05,995][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:01:06,655][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:01:07,640][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:01:08,299][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:01:08,958][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:01:09,618][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:01:10,277][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:01:10,937][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:01:11,598][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:01:12,257][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:01:12,916][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:01:13,577][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:01:14,236][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:01:14,895][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:01:15,556][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:01:16,217][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:01:16,877][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:01:17,535][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:01:18,193][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:01:19,002][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 21:01:20,402][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:01:20,405][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:01:20,406][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:01:22,135][__main__][INFO] - Iteration 449 took 54s (11.55% Gen, 85.28% Train). Generation: 6s, Training: 46s. Estimated remaining time: 8h 14m 56s. Estimated total time: 15h 10m 12s. Time estimates for 10 more iterations: 9m 6s, 100 more iterations: 1h 31m 1s, 500 more iterations: 7h 35m 6s. [2026-03-25 21:01:22,137][__main__][INFO] - Starting iteration 449. [2026-03-25 21:01:22,140][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 21:01:22,141][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:01:27,269][__main__][INFO] - Number of regex retries in iteration 449: 0 [2026-03-25 21:01:27,270][__main__][INFO] - agents played in iteration 449 are Bob, Alice [2026-03-25 21:01:27,758][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:01:27,820][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:01:27,820][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:01:27,821][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:01:28,476][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:01:29,100][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:01:29,759][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:01:30,416][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:01:31,076][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:01:31,734][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:01:32,393][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:01:33,050][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:01:33,713][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:01:34,371][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:01:35,030][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:01:35,688][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:01:36,346][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:01:37,005][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:01:37,665][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:01:38,324][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:01:38,983][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:01:39,644][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:01:40,303][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:01:40,962][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:01:41,621][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:01:42,281][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:01:42,939][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:01:43,599][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:01:44,257][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:01:44,916][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:01:46,647][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:01:47,305][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:01:47,964][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:01:48,625][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:01:49,285][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:01:49,945][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:01:50,604][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:01:51,264][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:01:51,929][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:01:52,589][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:01:53,247][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:01:53,905][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:01:54,564][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:01:55,222][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:01:55,884][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:01:56,541][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:01:57,201][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:01:57,860][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:01:58,519][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:01:59,178][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:01:59,838][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:02:00,496][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:02:01,492][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:02:02,153][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:02:02,811][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:02:03,471][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:02:04,130][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:02:05,438][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:02:06,097][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:02:06,756][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:02:07,416][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:02:08,076][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:02:08,734][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:02:09,392][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:02:10,051][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:02:10,711][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:02:11,369][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:02:12,027][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:02:12,685][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:02:13,461][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:44 [2026-03-25 21:02:14,943][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:02:14,945][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:02:14,947][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:02:16,511][__main__][INFO] - Iteration 450 took 54s (9.43% Gen, 87.69% Train). Generation: 5s, Training: 47s. Estimated remaining time: 8h 10m 1s. Estimated total time: 15h 6m 12s. Time estimates for 10 more iterations: 9m 3s, 100 more iterations: 1h 30m 37s, 500 more iterations: 7h 33m 6s. [2026-03-25 21:02:16,513][__main__][INFO] - Starting iteration 450. [2026-03-25 21:02:16,517][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 21:02:16,517][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:02:21,286][__main__][INFO] - Number of regex retries in iteration 450: 0 [2026-03-25 21:02:21,287][__main__][INFO] - agents played in iteration 450 are Bob, Alice [2026-03-25 21:02:21,891][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:02:21,953][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:02:21,953][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:02:21,954][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:02:22,737][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:02:23,354][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:02:24,014][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:02:24,671][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:02:25,330][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:02:25,988][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:02:26,648][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:02:27,306][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:02:27,963][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:02:28,622][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:02:29,281][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:02:29,941][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:02:30,598][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:02:31,257][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:02:31,916][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:02:32,575][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:02:33,233][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:02:33,892][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:02:34,551][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:02:35,210][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:02:35,869][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:02:36,528][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:02:37,186][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:02:37,844][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:02:38,502][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:02:39,161][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:02:39,821][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:02:40,479][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:02:41,137][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:02:41,798][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:02:42,456][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:02:43,115][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:02:43,773][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:02:44,431][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:02:45,089][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:02:45,747][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:02:46,405][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:02:47,063][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:02:47,722][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:02:48,380][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:02:49,038][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:02:49,697][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:02:50,355][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:02:51,013][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:02:51,671][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:02:52,331][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:02:52,989][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:02:53,647][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:02:54,631][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:02:55,293][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:02:55,951][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:02:56,609][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:02:57,272][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:02:57,929][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:02:58,592][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:02:59,249][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:03:01,005][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:03:01,660][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:03:02,318][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:03:02,975][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:03:03,634][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:03:04,294][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:03:04,950][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:03:05,609][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:03:06,268][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:03:06,991][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:44 [2026-03-25 21:03:08,323][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:03:08,326][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:03:08,328][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:03:11,320][__main__][INFO] - Iteration 451 took 54s (8.70% Gen, 85.83% Train). Generation: 4s, Training: 47s. Estimated remaining time: 8h 16m 19s. Estimated total time: 15h 13m 25s. Time estimates for 10 more iterations: 9m 8s, 100 more iterations: 1h 31m 20s, 500 more iterations: 7h 36m 42s. [2026-03-25 21:03:11,323][__main__][INFO] - Starting iteration 451. [2026-03-25 21:03:11,326][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 21:03:11,327][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:03:27,208][__main__][INFO] - Number of regex retries in iteration 451: 0 [2026-03-25 21:03:27,209][__main__][INFO] - agents played in iteration 451 are Bob, Alice [2026-03-25 21:03:28,363][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:03:28,425][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:03:28,426][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:03:28,427][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:03:29,085][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:03:29,696][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:03:30,354][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:03:31,014][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:03:31,672][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:03:32,331][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:03:32,989][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:03:33,648][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:03:34,306][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:03:34,963][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:03:35,625][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:03:36,285][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:03:36,942][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:03:37,603][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:03:38,261][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:03:38,919][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:03:39,577][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:03:40,236][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:03:40,894][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:03:41,552][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:03:42,211][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:03:42,868][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:03:43,525][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:03:44,183][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:03:44,841][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:03:45,499][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:03:46,157][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:03:46,814][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:03:47,473][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:03:48,132][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:03:48,791][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:03:49,449][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:03:50,107][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:03:50,765][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:03:51,425][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:03:52,084][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:03:52,746][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:03:53,405][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:03:54,062][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:03:54,720][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:03:55,379][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:03:56,037][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:03:56,694][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:03:57,352][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:03:58,009][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:03:58,667][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:03:59,325][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:03:59,984][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:04:00,966][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:04:01,625][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:04:02,283][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:04:02,942][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:04:03,601][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:04:04,259][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:04:04,917][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:04:05,575][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:04:06,233][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:04:06,892][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:04:07,550][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:04:08,208][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:04:08,866][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:04:09,525][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:04:10,184][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:04:10,842][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:04:11,501][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:04:12,259][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 21:04:13,807][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:04:13,810][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:04:13,811][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:04:15,349][__main__][INFO] - Iteration 452 took 1m 4s (24.81% Gen, 72.79% Train). Generation: 15s, Training: 46s. Estimated remaining time: 10h 48m 54s. Estimated total time: 17h 47m 4s. Time estimates for 10 more iterations: 10m 40s, 100 more iterations: 1h 46m 42s, 500 more iterations: 8h 53m 32s. [2026-03-25 21:04:15,352][__main__][INFO] - Starting iteration 452. [2026-03-25 21:04:15,355][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 21:04:15,355][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:04:20,520][__main__][INFO] - Number of regex retries in iteration 452: 0 [2026-03-25 21:04:20,520][__main__][INFO] - agents played in iteration 452 are Bob, Alice [2026-03-25 21:04:21,058][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:04:21,119][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:04:21,120][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:04:21,120][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:04:21,935][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:04:22,554][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:04:23,214][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:04:23,874][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:04:24,533][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:04:25,193][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:04:25,853][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:04:26,513][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:04:27,174][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:04:27,833][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:04:28,494][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:04:29,153][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:04:29,812][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:04:30,471][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:04:31,132][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:04:31,791][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:04:32,451][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:04:33,113][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:04:33,772][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:04:34,430][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:04:35,090][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:04:35,748][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:04:36,407][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:04:37,066][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:04:37,726][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:04:38,385][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:04:39,044][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:04:39,703][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:04:40,363][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:04:41,022][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:04:41,681][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:04:42,341][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:04:43,001][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:04:43,660][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:04:44,319][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:04:44,978][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:04:45,641][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:04:46,298][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:04:46,957][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:04:47,617][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:04:48,277][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:04:48,937][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:04:49,597][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:04:50,257][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:04:50,916][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:04:51,577][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:04:52,239][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:04:52,900][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:04:53,891][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:04:54,554][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:04:55,213][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:04:55,874][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:04:56,536][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:04:57,195][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:04:57,856][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:04:58,514][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:04:59,173][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:04:59,832][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:05:00,491][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:05:01,152][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:05:01,810][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:05:02,468][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:05:03,126][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:05:03,784][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:05:04,442][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:05:05,381][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 21:05:06,705][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:05:06,708][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:05:06,709][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:05:08,262][__main__][INFO] - Iteration 453 took 52s (9.76% Gen, 87.30% Train). Generation: 5s, Training: 46s. Estimated remaining time: 7h 42m 45s. Estimated total time: 14h 41m 48s. Time estimates for 10 more iterations: 8m 49s, 100 more iterations: 1h 28m 10s, 500 more iterations: 7h 20m 54s. [2026-03-25 21:05:08,264][__main__][INFO] - Starting iteration 453. [2026-03-25 21:05:08,269][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 21:05:08,269][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:05:13,036][__main__][INFO] - Number of regex retries in iteration 453: 0 [2026-03-25 21:05:13,038][__main__][INFO] - agents played in iteration 453 are Bob, Alice [2026-03-25 21:05:13,652][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:05:14,687][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:05:14,688][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:05:14,688][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:05:15,479][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:05:16,107][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:05:16,765][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:05:17,422][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:05:18,081][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:05:18,739][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:05:19,396][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:05:20,053][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:05:20,711][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:05:21,368][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:05:22,025][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:05:22,683][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:05:23,340][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:05:24,000][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:05:24,659][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:05:25,320][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:05:25,979][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:05:26,639][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:05:27,299][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:05:27,959][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:05:28,618][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:05:29,277][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:05:29,936][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:05:30,595][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:05:31,253][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:05:31,912][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:05:32,570][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:05:33,229][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:05:33,888][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:05:34,546][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:05:35,204][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:05:35,864][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:05:36,524][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:05:37,182][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:05:37,839][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:05:38,498][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:05:39,156][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:05:39,815][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:05:40,474][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:05:41,133][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:05:41,791][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:05:42,449][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:05:43,108][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:05:43,767][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:05:44,425][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:05:45,084][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:05:45,742][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:05:46,401][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:05:47,388][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:05:48,047][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:05:48,706][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:05:49,364][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:05:50,023][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:05:50,681][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:05:51,339][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:05:51,997][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:05:52,655][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:05:53,313][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:05:53,972][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:05:54,630][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:05:55,294][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:05:55,951][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:05:56,609][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:05:57,268][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:05:57,926][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:05:58,714][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 21:06:00,132][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:06:00,136][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:06:00,137][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:06:01,654][__main__][INFO] - Iteration 454 took 53s (8.93% Gen, 88.22% Train). Generation: 4s, Training: 47s. Estimated remaining time: 7h 49m 51s. Estimated total time: 14h 49m 47s. Time estimates for 10 more iterations: 8m 53s, 100 more iterations: 1h 28m 58s, 500 more iterations: 7h 24m 53s. [2026-03-25 21:06:01,657][__main__][INFO] - Starting iteration 454. [2026-03-25 21:06:01,661][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 21:06:01,661][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:06:06,976][__main__][INFO] - Number of regex retries in iteration 454: 0 [2026-03-25 21:06:06,977][__main__][INFO] - agents played in iteration 454 are Bob, Alice [2026-03-25 21:06:07,550][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:06:07,612][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:06:07,613][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:06:07,613][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:06:08,271][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:06:08,885][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:06:09,544][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:06:10,204][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:06:10,863][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:06:11,522][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:06:12,181][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:06:12,840][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:06:13,500][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:06:14,159][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:06:14,817][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:06:15,475][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:06:16,133][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:06:16,792][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:06:17,452][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:06:18,110][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:06:18,767][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:06:19,426][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:06:20,084][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:06:20,744][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:06:21,402][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:06:22,063][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:06:22,722][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:06:23,381][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:06:24,042][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:06:24,705][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:06:25,360][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:06:26,018][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:06:26,676][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:06:27,335][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:06:27,992][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:06:28,651][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:06:29,309][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:06:29,969][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:06:30,627][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:06:31,285][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:06:31,943][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:06:32,601][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:06:33,259][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:06:33,918][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:06:34,575][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:06:35,233][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:06:35,891][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:06:36,551][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:06:37,209][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:06:37,867][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:06:38,525][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:06:39,183][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:06:40,170][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:06:40,829][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:06:41,488][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:06:42,145][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:06:42,803][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:06:43,461][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:06:44,120][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:06:44,778][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:06:45,437][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:06:46,094][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:06:46,752][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:06:47,409][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:06:48,066][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:06:48,725][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:06:49,383][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:06:50,043][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:06:50,701][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:06:51,493][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 21:06:52,912][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:06:52,915][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:06:52,916][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:06:54,320][__main__][INFO] - Iteration 455 took 52s (10.09% Gen, 87.24% Train). Generation: 5s, Training: 45s. Estimated remaining time: 7h 36m 51s. Estimated total time: 14h 37m 40s. Time estimates for 10 more iterations: 8m 46s, 100 more iterations: 1h 27m 46s, 500 more iterations: 7h 18m 50s. [2026-03-25 21:06:54,322][__main__][INFO] - Starting iteration 455. [2026-03-25 21:06:54,326][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 21:06:54,326][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:06:59,551][__main__][INFO] - Number of regex retries in iteration 455: 0 [2026-03-25 21:06:59,552][__main__][INFO] - agents played in iteration 455 are Bob, Alice [2026-03-25 21:07:00,597][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:07:00,658][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:07:00,659][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:07:00,659][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:07:01,431][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:07:02,044][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:07:02,704][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:07:03,364][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:07:04,022][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:07:04,680][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:07:05,339][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:07:05,997][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:07:06,657][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:07:07,316][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:07:07,979][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:07:08,634][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:07:09,292][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:07:09,951][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:07:10,611][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:07:11,269][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:07:11,927][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:07:12,585][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:07:13,244][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:07:13,902][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:07:14,561][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:07:15,219][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:07:15,877][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:07:16,535][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:07:17,192][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:07:17,851][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:07:18,510][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:07:19,169][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:07:19,826][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:07:20,485][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:07:21,144][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:07:21,804][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:07:22,462][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:07:23,121][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:07:23,779][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:07:24,440][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:07:25,099][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:07:25,759][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:07:26,416][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:07:27,074][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:07:27,732][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:07:28,390][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:07:29,049][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:07:29,707][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:07:30,365][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:07:31,023][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:07:31,682][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:07:32,341][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:07:33,335][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:07:33,994][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:07:34,654][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:07:35,313][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:07:35,972][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:07:36,631][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:07:37,291][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:07:37,953][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:07:38,612][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:07:39,274][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:07:39,934][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:07:40,593][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:07:41,253][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:07:41,912][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:07:42,571][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:07:43,231][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:07:43,890][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:07:44,667][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 21:07:46,264][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:07:46,266][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:07:46,267][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:07:47,685][__main__][INFO] - Iteration 456 took 53s (9.79% Gen, 87.55% Train). Generation: 5s, Training: 46s. Estimated remaining time: 7h 47m 38s. Estimated total time: 14h 49m 21s. Time estimates for 10 more iterations: 8m 53s, 100 more iterations: 1h 28m 56s, 500 more iterations: 7h 24m 40s. [2026-03-25 21:07:47,687][__main__][INFO] - Starting iteration 456. [2026-03-25 21:07:47,692][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 21:07:47,692][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:07:53,360][__main__][INFO] - Number of regex retries in iteration 456: 0 [2026-03-25 21:07:53,361][__main__][INFO] - agents played in iteration 456 are Bob, Alice [2026-03-25 21:07:53,871][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:07:53,932][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:07:53,933][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:07:53,934][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:07:54,652][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:07:55,280][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:07:55,940][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:07:56,598][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:07:57,257][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:07:57,915][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:07:58,573][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:07:59,231][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:08:03,928][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:08:04,586][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:08:05,245][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:08:05,903][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:08:06,561][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:08:07,218][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:08:07,875][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:08:08,532][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:08:09,189][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:08:09,847][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:08:10,505][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:08:11,163][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:08:11,820][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:08:12,478][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:08:13,136][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:08:13,795][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:08:14,450][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:08:15,107][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:08:15,765][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:08:16,423][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:08:17,080][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:08:17,739][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:08:18,398][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:08:19,056][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:08:19,715][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:08:20,372][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:08:21,032][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:08:21,690][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:08:22,348][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:08:23,006][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:08:23,664][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:08:24,323][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:08:24,982][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:08:25,640][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:08:26,299][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:08:26,957][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:08:27,616][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:08:28,275][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:08:28,934][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:08:29,592][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:08:30,580][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:08:31,239][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:08:31,898][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:08:32,557][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:08:33,218][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:08:33,876][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:08:34,534][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:08:35,191][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:08:35,880][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:08:36,538][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:08:37,197][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:08:37,854][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:08:38,513][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:08:39,170][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:08:39,829][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:08:40,489][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:08:41,147][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:08:41,937][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:47 [2026-03-25 21:08:43,589][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:08:43,592][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:08:43,593][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:08:45,078][__main__][INFO] - Iteration 457 took 57s (9.88% Gen, 87.53% Train). Generation: 5s, Training: 50s. Estimated remaining time: 8h 53m 48s. Estimated total time: 15h 56m 28s. Time estimates for 10 more iterations: 9m 33s, 100 more iterations: 1h 35m 38s, 500 more iterations: 7h 58m 14s. [2026-03-25 21:08:45,081][__main__][INFO] - Starting iteration 457. [2026-03-25 21:08:45,084][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 21:08:45,085][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:08:50,004][__main__][INFO] - Number of regex retries in iteration 457: 0 [2026-03-25 21:08:50,005][__main__][INFO] - agents played in iteration 457 are Bob, Alice [2026-03-25 21:08:50,603][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:08:50,666][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:08:50,666][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:08:50,667][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:08:51,383][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:08:52,004][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:08:52,662][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:08:53,321][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:08:53,980][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:08:54,640][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:08:55,298][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:08:55,957][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:08:56,615][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:08:57,274][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:08:57,933][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:08:58,592][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:08:59,258][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:08:59,918][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:09:00,576][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:09:01,235][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:09:01,894][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:09:02,552][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:09:03,210][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:09:03,869][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:09:04,528][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:09:05,185][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:09:05,843][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:09:06,502][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:09:07,159][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:09:07,818][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:09:08,475][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:09:09,133][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:09:09,791][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:09:10,450][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:09:11,108][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:09:11,766][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:09:12,425][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:09:13,082][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:09:13,740][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:09:14,398][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:09:15,057][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:09:15,715][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:09:16,373][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:09:17,031][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:09:17,689][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:09:18,346][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:09:19,004][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:09:19,662][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:09:20,321][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:09:20,980][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:09:21,638][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:09:22,296][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:09:23,288][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:09:23,946][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:09:24,604][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:09:25,262][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:09:25,919][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:09:26,577][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:09:27,235][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:09:27,892][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:09:28,550][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:09:29,209][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:09:29,867][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:09:30,524][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:09:31,183][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:09:31,841][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:09:32,499][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:09:33,157][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:09:33,816][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:09:34,622][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 21:09:36,043][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:09:36,045][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:09:36,046][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:09:37,561][__main__][INFO] - Iteration 458 took 52s (9.38% Gen, 87.73% Train). Generation: 4s, Training: 46s. Estimated remaining time: 7h 31m 6s. Estimated total time: 14h 34m 38s. Time estimates for 10 more iterations: 8m 44s, 100 more iterations: 1h 27m 27s, 500 more iterations: 7h 17m 19s. [2026-03-25 21:09:37,563][__main__][INFO] - Starting iteration 458. [2026-03-25 21:09:37,567][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 21:09:37,567][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:09:42,468][__main__][INFO] - Number of regex retries in iteration 458: 0 [2026-03-25 21:09:42,469][__main__][INFO] - agents played in iteration 458 are Bob, Alice [2026-03-25 21:09:42,960][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:09:43,023][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:09:43,024][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:09:43,024][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:09:43,730][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:09:44,363][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:09:45,019][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:09:45,677][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:09:46,339][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:09:46,998][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:09:47,657][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:09:48,314][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:09:48,974][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:09:49,632][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:09:50,290][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:09:50,949][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:09:51,607][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:09:52,264][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:09:52,923][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:09:53,582][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:09:54,240][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:09:54,897][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:09:55,555][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:09:56,213][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:09:56,871][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:09:57,528][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:09:58,186][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:09:58,844][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:09:59,502][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:10:00,160][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:10:00,818][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:10:01,476][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:10:02,135][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:10:02,793][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:10:03,450][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:10:04,107][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:10:04,765][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:10:05,424][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:10:06,082][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:10:06,740][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:10:07,398][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:10:08,055][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:10:08,713][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:10:09,371][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:10:10,028][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:10:10,686][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:10:11,343][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:10:12,000][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:10:12,658][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:10:13,317][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:10:13,974][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:10:14,631][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:10:15,611][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:10:16,272][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:10:16,930][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:10:17,589][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:10:18,248][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:10:18,907][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:10:19,565][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:10:20,224][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:10:20,883][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:10:21,542][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:10:22,199][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:10:22,857][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:10:23,514][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:10:24,172][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:10:24,830][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:10:25,487][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:10:26,145][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:10:27,058][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 21:10:28,612][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:10:28,614][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:10:28,615][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:10:30,127][__main__][INFO] - Iteration 459 took 52s (9.33% Gen, 87.79% Train). Generation: 4s, Training: 46s. Estimated remaining time: 7h 31m 36s. Estimated total time: 14h 36m 1s. Time estimates for 10 more iterations: 8m 45s, 100 more iterations: 1h 27m 36s, 500 more iterations: 7h 18m 0s. [2026-03-25 21:10:30,129][__main__][INFO] - Starting iteration 459. [2026-03-25 21:10:30,133][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 21:10:30,134][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:10:36,076][__main__][INFO] - Number of regex retries in iteration 459: 0 [2026-03-25 21:10:36,078][__main__][INFO] - agents played in iteration 459 are Bob, Alice [2026-03-25 21:10:37,211][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:10:37,272][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:10:37,273][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:10:37,274][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:10:37,947][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:10:38,582][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:10:39,242][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:10:39,900][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:10:40,560][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:10:41,218][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:10:41,876][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:10:42,534][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:10:43,191][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:10:43,851][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:10:44,508][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:10:45,166][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:10:45,823][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:10:46,481][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:10:47,138][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:10:47,796][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:10:48,453][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:10:49,113][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:10:49,772][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:10:50,431][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:10:51,089][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:10:51,746][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:10:52,403][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:10:53,061][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:10:53,719][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:10:54,383][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:10:55,041][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:10:55,699][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:10:56,356][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:10:57,016][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:10:57,674][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:10:58,332][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:10:58,990][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:10:59,649][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:11:00,310][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:11:00,973][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:11:01,627][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:11:02,286][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:11:02,946][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:11:03,604][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:11:04,262][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:11:04,919][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:11:05,577][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:11:06,236][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:11:06,894][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:11:07,553][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:11:08,211][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:11:08,869][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:11:09,863][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:11:10,522][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:11:11,181][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:11:11,839][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:11:12,499][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:11:13,158][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:11:13,816][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:11:14,473][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:11:15,132][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:11:15,790][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:11:16,448][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:11:17,106][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:11:17,764][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:11:18,423][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:11:19,081][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:11:19,739][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:11:20,398][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:11:21,185][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 21:11:22,760][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:11:22,763][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:11:22,765][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:11:24,152][__main__][INFO] - Iteration 460 took 54s (11.00% Gen, 86.43% Train). Generation: 5s, Training: 46s. Estimated remaining time: 7h 55m 1s. Estimated total time: 15h 0m 20s. Time estimates for 10 more iterations: 9m 0s, 100 more iterations: 1h 30m 2s, 500 more iterations: 7h 30m 10s. [2026-03-25 21:11:24,154][__main__][INFO] - Starting iteration 460. [2026-03-25 21:11:24,158][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 21:11:24,159][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:11:28,970][__main__][INFO] - Number of regex retries in iteration 460: 0 [2026-03-25 21:11:28,971][__main__][INFO] - agents played in iteration 460 are Bob, Alice [2026-03-25 21:11:29,468][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:11:29,529][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:11:29,530][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:11:29,531][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:11:30,386][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:11:31,010][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:11:31,672][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:11:32,329][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:11:32,990][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:11:33,646][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:11:34,304][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:11:34,961][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:11:35,618][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:11:36,276][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:11:36,935][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:11:37,593][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:11:38,251][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:11:38,908][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:11:39,566][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:11:40,226][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:11:40,887][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:11:41,545][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:11:42,206][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:11:42,865][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:11:43,522][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:11:44,179][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:11:44,837][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:11:45,494][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:11:46,152][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:11:46,811][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:11:47,470][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:11:48,128][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:11:48,787][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:11:49,445][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:11:50,102][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:11:50,760][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:11:51,417][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:11:52,075][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:11:52,734][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:11:53,391][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:11:54,049][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:11:54,708][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:11:55,367][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:11:56,026][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:11:56,684][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:11:57,342][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:11:58,000][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:11:58,660][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:11:59,317][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:11:59,975][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:12:00,633][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:12:01,291][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:12:02,282][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:12:02,940][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:12:03,598][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:12:04,255][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:12:04,915][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:12:05,573][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:12:06,233][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:12:06,891][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:12:07,549][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:12:08,206][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:12:08,864][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:12:09,523][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:12:10,181][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:12:10,840][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:12:11,500][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:12:12,158][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:12:12,817][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:12:13,602][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 21:12:15,074][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:12:15,077][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:12:15,078][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:12:16,465][__main__][INFO] - Iteration 461 took 52s (9.20% Gen, 88.14% Train). Generation: 4s, Training: 46s. Estimated remaining time: 7h 25m 37s. Estimated total time: 14h 31m 48s. Time estimates for 10 more iterations: 8m 43s, 100 more iterations: 1h 27m 10s, 500 more iterations: 7h 15m 54s. [2026-03-25 21:12:16,467][__main__][INFO] - Starting iteration 461. [2026-03-25 21:12:16,470][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 21:12:16,471][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:12:21,179][__main__][INFO] - Number of regex retries in iteration 461: 0 [2026-03-25 21:12:21,180][__main__][INFO] - agents played in iteration 461 are Bob, Alice [2026-03-25 21:12:21,661][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:12:21,724][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:12:21,724][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:12:21,725][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:12:22,508][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:12:23,126][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:12:23,786][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:12:24,445][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:12:25,104][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:12:25,763][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:12:26,427][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:12:27,085][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:12:27,745][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:12:28,404][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:12:29,062][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:12:29,722][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:12:30,381][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:12:31,041][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:12:31,701][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:12:32,360][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:12:33,021][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:12:33,680][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:12:34,341][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:12:35,000][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:12:35,659][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:12:36,318][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:12:36,976][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:12:37,637][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:12:38,298][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:12:38,958][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:12:39,618][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:12:40,280][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:12:40,939][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:12:41,600][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:12:42,260][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:12:42,918][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:12:43,577][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:12:44,236][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:12:44,895][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:12:45,553][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:12:46,214][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:12:46,874][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:12:47,535][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:12:48,195][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:12:48,855][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:12:49,514][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:12:50,173][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:12:50,832][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:12:51,491][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:12:52,149][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:12:52,808][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:12:53,467][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:12:54,457][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:12:55,116][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:12:55,774][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:12:56,436][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:12:57,095][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:12:57,751][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:12:58,409][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:12:59,066][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:12:59,723][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:13:00,382][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:13:01,043][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:13:01,702][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:13:02,364][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:13:03,024][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:13:03,682][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:13:04,341][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:13:05,002][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:13:05,935][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 21:13:07,344][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:13:07,347][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:13:07,348][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:13:08,840][__main__][INFO] - Iteration 462 took 52s (8.99% Gen, 88.16% Train). Generation: 4s, Training: 46s. Estimated remaining time: 7h 25m 47s. Estimated total time: 14h 32m 51s. Time estimates for 10 more iterations: 8m 43s, 100 more iterations: 1h 27m 17s, 500 more iterations: 7h 16m 25s. [2026-03-25 21:13:08,843][__main__][INFO] - Starting iteration 462. [2026-03-25 21:13:08,846][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 21:13:08,847][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:13:14,359][__main__][INFO] - Number of regex retries in iteration 462: 0 [2026-03-25 21:13:14,361][__main__][INFO] - agents played in iteration 462 are Bob, Alice [2026-03-25 21:13:15,352][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:13:15,414][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:13:15,414][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:13:15,415][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:13:16,086][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:13:16,697][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:13:17,355][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:13:18,015][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:13:18,674][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:13:19,333][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:13:19,993][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:13:20,652][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:13:21,310][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:13:21,969][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:13:22,627][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:13:23,285][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:13:23,942][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:13:24,601][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:13:25,259][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:13:25,917][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:13:26,575][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:13:27,233][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:13:27,892][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:13:28,550][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:13:29,208][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:13:29,867][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:13:30,525][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:13:31,183][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:13:31,842][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:13:32,500][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:13:33,159][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:13:33,818][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:13:34,476][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:13:35,133][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:13:35,790][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:13:36,449][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:13:37,107][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:13:37,766][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:13:38,424][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:13:39,082][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:13:39,740][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:13:40,398][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:13:41,056][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:13:41,714][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:13:42,371][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:13:43,029][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:13:43,687][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:13:44,346][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:13:45,007][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:13:45,667][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:13:46,325][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:13:46,983][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:13:47,976][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:13:48,634][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:13:49,291][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:13:49,949][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:13:50,608][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:13:51,266][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:13:51,925][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:13:52,585][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:13:53,243][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:13:53,901][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:13:54,559][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:13:55,218][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:13:55,876][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:13:56,535][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:13:57,194][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:13:57,853][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:13:58,513][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:13:59,290][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 21:14:01,377][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:14:01,380][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:14:01,381][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:14:02,707][__main__][INFO] - Iteration 463 took 53s (10.24% Gen, 87.30% Train). Generation: 5s, Training: 47s. Estimated remaining time: 7h 49m 44s. Estimated total time: 14h 57m 42s. Time estimates for 10 more iterations: 8m 58s, 100 more iterations: 1h 29m 46s, 500 more iterations: 7h 28m 51s. [2026-03-25 21:14:02,709][__main__][INFO] - Starting iteration 463. [2026-03-25 21:14:02,713][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 21:14:02,713][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:14:09,219][__main__][INFO] - Number of regex retries in iteration 463: 0 [2026-03-25 21:14:09,220][__main__][INFO] - agents played in iteration 463 are Bob, Alice [2026-03-25 21:14:09,734][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:14:09,800][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:14:09,801][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:14:09,801][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:14:10,519][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:14:11,145][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:14:11,796][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:14:12,459][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:14:13,116][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:14:13,774][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:14:14,431][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:14:15,090][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:14:15,748][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:14:16,406][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:14:17,064][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:14:17,724][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:14:18,382][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:14:19,040][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:14:19,698][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:14:20,356][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:14:21,014][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:14:21,673][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:14:22,330][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:14:22,987][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:14:23,644][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:14:24,304][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:14:24,962][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:14:25,619][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:14:26,277][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:14:26,934][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:14:27,591][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:14:28,250][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:14:28,908][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:14:29,566][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:14:30,224][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:14:30,882][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:14:31,540][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:14:32,197][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:14:32,856][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:14:33,513][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:14:34,171][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:14:34,830][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:14:35,489][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:14:36,146][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:14:36,804][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:14:37,461][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:14:38,119][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:14:38,776][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:14:39,435][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:14:40,093][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:14:40,753][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:14:41,412][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:14:42,399][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:14:43,058][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:14:43,717][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:14:44,375][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:14:45,033][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:14:45,691][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:14:46,351][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:14:47,010][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:14:47,670][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:14:48,329][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:14:48,988][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:14:49,646][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:14:50,305][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:14:50,964][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:14:51,622][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:14:52,280][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:14:52,938][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:14:53,711][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 21:14:55,407][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:14:55,410][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:14:55,411][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:14:56,844][__main__][INFO] - Iteration 464 took 54s (12.02% Gen, 85.33% Train). Generation: 6s, Training: 46s. Estimated remaining time: 7h 53m 21s. Estimated total time: 15h 2m 12s. Time estimates for 10 more iterations: 9m 1s, 100 more iterations: 1h 30m 13s, 500 more iterations: 7h 31m 6s. [2026-03-25 21:14:56,846][__main__][INFO] - Starting iteration 464. [2026-03-25 21:14:56,850][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 21:14:56,850][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:15:02,026][__main__][INFO] - Number of regex retries in iteration 464: 0 [2026-03-25 21:15:02,027][__main__][INFO] - agents played in iteration 464 are Bob, Alice [2026-03-25 21:15:02,554][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:15:02,615][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:15:02,615][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:15:02,616][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:15:03,437][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:15:04,071][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:15:04,723][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:15:05,379][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:15:06,038][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:15:06,697][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:15:07,353][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:15:08,016][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:15:08,677][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:15:09,334][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:15:09,995][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:15:10,654][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:15:11,315][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:15:11,973][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:15:12,631][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:15:13,291][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:15:13,949][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:15:14,607][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:15:15,264][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:15:15,923][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:15:16,580][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:15:17,237][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:15:17,895][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:15:18,553][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:15:19,211][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:15:19,869][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:15:20,528][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:15:21,188][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:15:21,846][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:15:22,505][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:15:23,165][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:15:23,824][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:15:24,482][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:15:25,141][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:15:25,800][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:15:26,458][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:15:27,116][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:15:27,774][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:15:28,432][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:15:29,090][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:15:29,748][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:15:30,405][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:15:31,063][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:15:31,721][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:15:32,379][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:15:33,036][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:15:33,695][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:15:34,353][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:15:35,344][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:15:36,003][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:15:36,661][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:15:37,319][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:15:37,977][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:15:38,636][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:15:39,293][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:15:39,952][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:15:40,610][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:15:41,270][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:15:41,928][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:15:42,589][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:15:43,249][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:15:43,909][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:15:44,568][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:15:45,227][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:15:45,886][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:15:46,651][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 21:15:48,152][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:15:48,155][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:15:48,156][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:15:49,526][__main__][INFO] - Iteration 465 took 52s (9.83% Gen, 87.57% Train). Generation: 5s, Training: 46s. Estimated remaining time: 7h 28m 13s. Estimated total time: 14h 37m 58s. Time estimates for 10 more iterations: 8m 46s, 100 more iterations: 1h 27m 47s, 500 more iterations: 7h 18m 59s. [2026-03-25 21:15:49,528][__main__][INFO] - Starting iteration 465. [2026-03-25 21:15:49,532][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 21:15:49,533][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:15:55,368][__main__][INFO] - Number of regex retries in iteration 465: 0 [2026-03-25 21:15:55,369][__main__][INFO] - agents played in iteration 465 are Bob, Alice [2026-03-25 21:15:55,975][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:15:56,037][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:15:56,037][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:15:56,038][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:15:56,711][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:15:57,315][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:15:57,975][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:15:58,634][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:15:59,293][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:15:59,951][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:16:00,611][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:16:01,269][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:16:01,927][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:16:02,585][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:16:03,243][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:16:03,902][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:16:04,561][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:16:05,219][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:16:05,877][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:16:06,536][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:16:07,194][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:16:07,852][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:16:08,510][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:16:09,168][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:16:09,826][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:16:10,483][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:16:11,141][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:16:11,799][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:16:12,458][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:16:13,115][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:16:13,773][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:16:14,430][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:16:15,089][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:16:15,747][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:16:16,405][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:16:17,062][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:16:17,720][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:16:18,377][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:16:19,034][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:16:19,693][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:16:20,351][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:16:21,008][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:16:21,667][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:16:22,324][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:16:22,982][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:16:23,641][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:16:24,298][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:16:24,956][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:16:25,614][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:16:26,273][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:16:26,930][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:16:27,589][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:16:28,577][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:16:30,077][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:16:30,735][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:16:31,393][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:16:32,051][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:16:32,708][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:16:33,366][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:16:34,024][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:16:34,684][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:16:35,342][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:16:36,000][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:16:36,658][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:16:37,316][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:16:37,975][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:16:38,633][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:16:39,291][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:16:39,949][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:16:40,875][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:44 [2026-03-25 21:16:42,446][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:16:42,448][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:16:42,450][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:16:43,945][__main__][INFO] - Iteration 466 took 54s (10.72% Gen, 86.52% Train). Generation: 5s, Training: 47s. Estimated remaining time: 7h 56m 16s. Estimated total time: 15h 6m 55s. Time estimates for 10 more iterations: 9m 4s, 100 more iterations: 1h 30m 41s, 500 more iterations: 7h 33m 27s. [2026-03-25 21:16:43,948][__main__][INFO] - Starting iteration 466. [2026-03-25 21:16:43,952][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 21:16:43,953][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:16:51,311][__main__][INFO] - Number of regex retries in iteration 466: 0 [2026-03-25 21:16:51,312][__main__][INFO] - agents played in iteration 466 are Bob, Alice [2026-03-25 21:16:52,181][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:16:52,242][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:16:52,243][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:16:52,243][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:16:53,124][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:16:53,751][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:16:54,409][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:16:55,070][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:16:55,729][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:16:56,388][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:16:57,047][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:16:57,707][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:16:58,366][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:16:59,025][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:16:59,683][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:17:00,341][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:17:01,001][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:17:01,659][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:17:02,318][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:17:02,977][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:17:03,635][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:17:04,295][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:17:04,954][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:17:05,619][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:17:06,278][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:17:06,935][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:17:07,594][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:17:08,251][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:17:08,909][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:17:09,566][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:17:10,224][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:17:10,882][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:17:11,541][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:17:12,199][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:17:12,857][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:17:13,516][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:17:14,174][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:17:14,832][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:17:15,490][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:17:16,147][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:17:16,805][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:17:17,463][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:17:18,120][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:17:18,778][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:17:19,436][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:17:20,094][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:17:20,752][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:17:21,411][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:17:22,069][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:17:22,727][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:17:23,385][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:17:24,044][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:17:25,026][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:17:25,685][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:17:26,342][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:17:27,002][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:17:27,659][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:17:28,320][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:17:28,978][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:17:29,637][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:17:30,295][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:17:30,953][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:17:31,611][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:17:32,269][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:17:32,927][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:17:33,586][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:17:34,245][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:17:34,904][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:17:35,564][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:17:36,265][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 21:17:37,793][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:17:37,795][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:17:37,796][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:17:39,331][__main__][INFO] - Iteration 467 took 55s (13.29% Gen, 83.94% Train). Generation: 7s, Training: 46s. Estimated remaining time: 8h 11m 26s. Estimated total time: 15h 23m 0s. Time estimates for 10 more iterations: 9m 13s, 100 more iterations: 1h 32m 18s, 500 more iterations: 7h 41m 30s. [2026-03-25 21:17:39,333][__main__][INFO] - Starting iteration 467. [2026-03-25 21:17:39,338][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 21:17:39,338][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:17:47,324][__main__][INFO] - Number of regex retries in iteration 467: 0 [2026-03-25 21:17:47,325][__main__][INFO] - agents played in iteration 467 are Bob, Alice [2026-03-25 21:17:48,388][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:17:48,449][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:17:48,450][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:17:48,450][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:17:49,312][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:17:49,928][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:17:50,588][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:17:51,247][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:17:51,907][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:17:52,568][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:17:53,226][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:17:53,884][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:17:54,542][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:17:55,203][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:17:55,862][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:17:56,520][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:17:57,178][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:17:57,836][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:17:58,494][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:17:59,152][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:17:59,809][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:18:00,467][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:18:01,126][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:18:01,784][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:18:02,442][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:18:03,101][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:18:03,759][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:18:04,418][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:18:05,076][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:18:05,734][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:18:06,392][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:18:07,050][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:18:07,708][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:18:08,367][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:18:09,027][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:18:09,685][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:18:10,343][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:18:11,002][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:18:11,660][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:18:12,318][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:18:12,976][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:18:13,634][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:18:14,292][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:18:14,950][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:18:15,608][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:18:16,265][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:18:16,923][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:18:17,582][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:18:18,240][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:18:18,898][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:18:19,557][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:18:20,216][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:18:21,254][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:18:21,913][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:18:22,573][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:18:23,231][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:18:23,891][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:18:24,551][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:18:25,211][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:18:25,869][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:18:26,527][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:18:27,186][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:18:27,844][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:18:28,503][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:18:29,161][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:18:29,820][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:18:30,479][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:18:31,136][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:18:31,795][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:18:32,596][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 21:18:33,971][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:18:33,974][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:18:33,975][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:18:35,423][__main__][INFO] - Iteration 468 took 56s (14.24% Gen, 83.17% Train). Generation: 7s, Training: 46s. Estimated remaining time: 8h 22m 17s. Estimated total time: 15h 34m 47s. Time estimates for 10 more iterations: 9m 20s, 100 more iterations: 1h 33m 28s, 500 more iterations: 7h 47m 23s. [2026-03-25 21:18:35,426][__main__][INFO] - Starting iteration 468. [2026-03-25 21:18:35,431][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 21:18:35,431][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:18:40,438][__main__][INFO] - Number of regex retries in iteration 468: 0 [2026-03-25 21:18:40,439][__main__][INFO] - agents played in iteration 468 are Bob, Alice [2026-03-25 21:18:40,935][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:18:40,997][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:18:40,998][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:18:40,998][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:18:41,838][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:18:42,453][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:18:43,114][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:18:43,772][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:18:44,430][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:18:45,087][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:18:45,746][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:18:46,404][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:18:47,063][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:18:47,722][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:18:48,381][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:18:49,039][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:18:49,698][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:18:50,355][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:18:51,013][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:18:51,672][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:18:52,331][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:18:52,989][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:18:53,650][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:18:54,310][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:18:54,969][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:18:55,627][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:18:56,285][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:18:56,942][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:18:57,600][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:18:58,258][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:18:58,918][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:18:59,575][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:19:00,234][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:19:00,892][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:19:01,552][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:19:02,210][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:19:02,868][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:19:03,526][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:19:04,183][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:19:04,842][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:19:05,499][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:19:06,158][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:19:06,817][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:19:07,476][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:19:08,135][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:19:08,794][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:19:09,453][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:19:10,113][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:19:10,773][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:19:11,432][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:19:12,091][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:19:12,752][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:19:13,734][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:19:14,395][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:19:15,055][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:19:15,714][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:19:16,382][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:19:17,041][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:19:17,700][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:19:18,360][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:19:19,019][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:19:19,678][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:19:20,338][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:19:20,996][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:19:21,657][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:19:22,316][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:19:22,975][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:19:23,637][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:19:24,296][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:19:25,182][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 21:19:26,610][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:19:26,613][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:19:26,614][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:19:28,148][__main__][INFO] - Iteration 469 took 52s (9.50% Gen, 87.58% Train). Generation: 5s, Training: 46s. Estimated remaining time: 7h 25m 17s. Estimated total time: 14h 38m 40s. Time estimates for 10 more iterations: 8m 47s, 100 more iterations: 1h 27m 52s, 500 more iterations: 7h 19m 20s. [2026-03-25 21:19:28,151][__main__][INFO] - Starting iteration 469. [2026-03-25 21:19:28,155][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 21:19:28,155][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:19:34,543][__main__][INFO] - Number of regex retries in iteration 469: 0 [2026-03-25 21:19:34,543][__main__][INFO] - agents played in iteration 469 are Bob, Alice [2026-03-25 21:19:35,033][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:19:35,094][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:19:35,095][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:19:35,095][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:19:35,768][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:19:36,379][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:19:37,041][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:19:37,703][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:19:38,362][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:19:39,020][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:19:39,678][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:19:40,337][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:19:40,995][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:19:41,654][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:19:42,311][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:19:42,971][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:19:43,629][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:19:44,289][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:19:44,947][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:19:45,607][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:19:46,265][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:19:46,923][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:19:47,581][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:19:48,240][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:19:48,899][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:19:49,557][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:19:50,216][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:19:50,876][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:19:51,535][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:19:52,193][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:19:52,852][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:19:53,511][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:19:54,169][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:19:54,828][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:19:55,487][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:19:56,144][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:19:56,804][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:19:57,463][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:19:58,131][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:19:58,780][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:19:59,439][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:20:00,097][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:20:00,755][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:20:01,413][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:20:02,071][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:20:02,730][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:20:03,389][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:20:04,047][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:20:04,705][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:20:05,362][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:20:06,020][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:20:06,679][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:20:07,727][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:20:08,386][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:20:09,043][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:20:09,702][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:20:10,361][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:20:11,019][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:20:11,677][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:20:12,336][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:20:12,994][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:20:13,651][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:20:14,310][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:20:14,968][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:20:15,625][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:20:16,283][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:20:16,942][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:20:17,601][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:20:18,260][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:20:19,093][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 21:20:20,415][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:20:20,417][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:20:20,418][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:20:21,909][__main__][INFO] - Iteration 470 took 53s (11.88% Gen, 85.34% Train). Generation: 6s, Training: 45s. Estimated remaining time: 7h 41m 39s. Estimated total time: 14h 55m 56s. Time estimates for 10 more iterations: 8m 57s, 100 more iterations: 1h 29m 35s, 500 more iterations: 7h 27m 58s. [2026-03-25 21:20:21,911][__main__][INFO] - Starting iteration 470. [2026-03-25 21:20:21,915][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 21:20:21,915][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:20:27,355][__main__][INFO] - Number of regex retries in iteration 470: 0 [2026-03-25 21:20:27,356][__main__][INFO] - agents played in iteration 470 are Bob, Alice [2026-03-25 21:20:27,927][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:20:27,988][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:20:27,989][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:20:27,989][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:20:28,805][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:20:29,437][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:20:30,081][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:20:30,740][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:20:31,400][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:20:32,060][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:20:32,718][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:20:33,378][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:20:34,037][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:20:34,695][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:20:35,353][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:20:36,011][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:20:36,669][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:20:37,326][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:20:37,986][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:20:38,643][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:20:39,301][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:20:39,960][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:20:40,619][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:20:41,277][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:20:41,935][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:20:42,593][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:20:43,252][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:20:43,910][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:20:44,568][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:20:45,226][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:20:45,884][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:20:46,541][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:20:47,200][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:20:47,858][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:20:48,515][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:20:49,173][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:20:49,831][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:20:50,490][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:20:51,149][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:20:51,807][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:20:52,464][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:20:53,122][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:20:53,780][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:20:54,438][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:20:55,096][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:20:55,754][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:20:56,413][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:20:57,071][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:20:57,730][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:20:58,389][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:20:59,047][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:20:59,705][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:21:00,734][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:21:01,392][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:21:02,050][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:21:02,709][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:21:03,366][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:21:04,026][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:21:04,684][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:21:05,343][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:21:06,002][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:21:06,660][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:21:07,318][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:21:07,975][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:21:08,634][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:21:09,293][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:21:09,952][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:21:10,611][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:21:11,270][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:21:12,067][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 21:21:13,528][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:21:13,531][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:21:13,533][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:21:14,979][__main__][INFO] - Iteration 471 took 53s (10.25% Gen, 87.02% Train). Generation: 5s, Training: 46s. Estimated remaining time: 7h 29m 16s. Estimated total time: 14h 44m 26s. Time estimates for 10 more iterations: 8m 50s, 100 more iterations: 1h 28m 26s, 500 more iterations: 7h 22m 13s. [2026-03-25 21:21:14,981][__main__][INFO] - Starting iteration 471. [2026-03-25 21:21:14,989][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 21:21:14,990][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:21:20,577][__main__][INFO] - Number of regex retries in iteration 471: 0 [2026-03-25 21:21:20,578][__main__][INFO] - agents played in iteration 471 are Bob, Alice [2026-03-25 21:21:21,167][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:21:21,228][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:21:21,229][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:21:21,229][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:21:22,130][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:21:22,744][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:21:23,404][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:21:24,064][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:21:24,723][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:21:25,383][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:21:26,042][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:21:26,701][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:21:27,360][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:21:28,019][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:21:28,679][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:21:29,339][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:21:29,997][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:21:30,656][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:21:31,315][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:21:31,975][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:21:32,634][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:21:33,293][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:21:33,951][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:21:34,612][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:21:35,272][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:21:35,932][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:21:36,599][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:21:37,257][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:21:37,917][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:21:38,576][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:21:39,236][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:21:39,895][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:21:40,554][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:21:41,214][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:21:41,874][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:21:42,535][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:21:43,194][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:21:43,853][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:21:44,513][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:21:45,172][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:21:45,832][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:21:46,490][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:21:47,149][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:21:47,808][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:21:48,466][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:21:49,126][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:21:49,785][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:21:50,443][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:21:51,101][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:21:51,760][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:21:52,419][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:21:53,078][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:21:54,073][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:21:55,157][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:21:55,815][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:21:56,474][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:21:57,133][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:21:57,791][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:21:58,450][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:21:59,108][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:21:59,767][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:22:00,425][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:22:01,083][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:22:01,741][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:22:02,398][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:22:03,056][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:22:03,714][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:22:04,372][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:22:05,031][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:22:05,818][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 21:22:07,156][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:22:07,159][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:22:07,160][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:22:08,665][__main__][INFO] - Iteration 472 took 53s (10.41% Gen, 86.78% Train). Generation: 5s, Training: 46s. Estimated remaining time: 7h 38m 33s. Estimated total time: 14h 54m 37s. Time estimates for 10 more iterations: 8m 56s, 100 more iterations: 1h 29m 27s, 500 more iterations: 7h 27m 18s. [2026-03-25 21:22:08,667][__main__][INFO] - Starting iteration 472. [2026-03-25 21:22:08,670][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 21:22:08,670][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:22:13,545][__main__][INFO] - Number of regex retries in iteration 472: 0 [2026-03-25 21:22:13,546][__main__][INFO] - agents played in iteration 472 are Bob, Alice [2026-03-25 21:22:14,050][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:22:14,111][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:22:14,111][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:22:14,112][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:22:14,846][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:22:15,482][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:22:16,143][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:22:16,802][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:22:17,461][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:22:18,121][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:22:18,780][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:22:19,439][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:22:20,098][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:22:20,756][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:22:21,415][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:22:22,073][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:22:22,732][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:22:23,390][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:22:24,049][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:22:24,709][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:22:25,368][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:22:26,027][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:22:26,686][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:22:27,344][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:22:28,002][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:22:28,661][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:22:29,321][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:22:29,980][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:22:30,639][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:22:31,298][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:22:31,956][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:22:32,614][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:22:33,273][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:22:33,932][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:22:34,591][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:22:35,249][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:22:35,907][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:22:36,567][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:22:37,225][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:22:37,884][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:22:38,542][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:22:39,201][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:22:39,860][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:22:40,517][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:22:41,176][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:22:41,836][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:22:42,494][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:22:43,153][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:22:43,811][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:22:44,470][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:22:45,130][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:22:45,789][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:22:46,777][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:22:47,436][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:22:48,096][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:22:48,755][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:22:49,414][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:22:50,072][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:22:50,730][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:22:51,390][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:22:52,048][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:22:52,707][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:22:53,365][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:22:54,023][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:22:54,682][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:22:55,340][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:22:56,002][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:22:56,661][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:22:57,319][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:22:58,079][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 21:22:59,413][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:22:59,416][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:22:59,447][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:23:00,946][__main__][INFO] - Iteration 473 took 52s (9.33% Gen, 87.80% Train). Generation: 4s, Training: 45s. Estimated remaining time: 7h 14m 22s. Estimated total time: 14h 31m 17s. Time estimates for 10 more iterations: 8m 42s, 100 more iterations: 1h 27m 7s, 500 more iterations: 7h 15m 38s. [2026-03-25 21:23:00,948][__main__][INFO] - Starting iteration 473. [2026-03-25 21:23:00,952][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 21:23:00,953][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:23:06,194][__main__][INFO] - Number of regex retries in iteration 473: 0 [2026-03-25 21:23:06,195][__main__][INFO] - agents played in iteration 473 are Bob, Alice [2026-03-25 21:23:06,967][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:23:07,030][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:23:07,031][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:23:07,031][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:23:07,763][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:23:08,378][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:23:09,038][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:23:09,696][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:23:10,355][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:23:11,015][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:23:11,674][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:23:12,334][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:23:12,993][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:23:13,651][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:23:14,310][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:23:14,968][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:23:15,627][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:23:16,285][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:23:16,943][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:23:17,601][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:23:18,259][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:23:18,916][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:23:19,575][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:23:20,234][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:23:20,892][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:23:21,550][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:23:22,208][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:23:22,866][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:23:23,525][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:23:24,184][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:23:24,842][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:23:25,500][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:23:26,157][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:23:26,814][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:23:27,474][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:23:28,133][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:23:28,791][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:23:29,449][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:23:30,107][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:23:30,766][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:23:31,425][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:23:32,083][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:23:32,741][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:23:33,400][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:23:34,058][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:23:34,716][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:23:35,374][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:23:36,031][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:23:36,691][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:23:37,349][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:23:38,008][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:23:38,666][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:23:39,652][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:23:40,319][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:23:40,979][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:23:41,636][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:23:42,294][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:23:42,952][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:23:43,610][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:23:44,269][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:23:44,927][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:23:45,586][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:23:46,244][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:23:46,902][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:23:47,559][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:23:48,217][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:23:48,877][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:23:49,535][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:23:50,194][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:23:51,017][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 21:23:52,395][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:23:52,398][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:23:52,399][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:23:53,851][__main__][INFO] - Iteration 474 took 52s (9.91% Gen, 87.34% Train). Generation: 5s, Training: 46s. Estimated remaining time: 7h 23m 52s. Estimated total time: 14h 41m 40s. Time estimates for 10 more iterations: 8m 49s, 100 more iterations: 1h 28m 10s, 500 more iterations: 7h 20m 50s. [2026-03-25 21:23:53,853][__main__][INFO] - Starting iteration 474. [2026-03-25 21:23:53,857][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 21:23:53,858][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:23:59,946][__main__][INFO] - Number of regex retries in iteration 474: 0 [2026-03-25 21:23:59,948][__main__][INFO] - agents played in iteration 474 are Bob, Alice [2026-03-25 21:24:00,578][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:24:00,641][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:24:00,642][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:24:00,642][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:24:01,284][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:24:01,906][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:24:02,565][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:24:03,224][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:24:03,881][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:24:04,541][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:24:05,199][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:24:05,857][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:24:06,516][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:24:07,174][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:24:07,832][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:24:08,490][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:24:09,148][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:24:09,805][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:24:10,464][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:24:11,121][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:24:11,780][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:24:12,438][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:24:13,097][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:24:13,755][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:24:14,413][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:24:15,071][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:24:15,729][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:24:16,387][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:24:17,044][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:24:17,702][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:24:18,359][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:24:19,018][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:24:19,675][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:24:20,334][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:24:20,991][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:24:21,651][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:24:22,309][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:24:22,966][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:24:23,624][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:24:24,281][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:24:24,939][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:24:25,597][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:24:26,257][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:24:26,915][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:24:27,572][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:24:28,231][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:24:28,889][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:24:29,549][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:24:30,207][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:24:30,865][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:24:31,524][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:24:32,182][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:24:33,164][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:24:33,824][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:24:34,484][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:24:35,145][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:24:35,803][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:24:36,462][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:24:37,121][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:24:37,780][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:24:38,439][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:24:39,097][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:24:39,755][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:24:40,414][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:24:41,073][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:24:41,732][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:24:42,391][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:24:43,049][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:24:43,707][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:24:44,487][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 21:24:45,840][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:24:45,843][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:24:45,844][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:24:47,300][__main__][INFO] - Iteration 475 took 53s (11.40% Gen, 85.88% Train). Generation: 6s, Training: 45s. Estimated remaining time: 7h 32m 2s. Estimated total time: 14h 50m 43s. Time estimates for 10 more iterations: 8m 54s, 100 more iterations: 1h 29m 4s, 500 more iterations: 7h 25m 21s. [2026-03-25 21:24:47,302][__main__][INFO] - Starting iteration 475. [2026-03-25 21:24:47,305][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 21:24:47,305][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:24:53,840][__main__][INFO] - Number of regex retries in iteration 475: 0 [2026-03-25 21:24:53,841][__main__][INFO] - agents played in iteration 475 are Bob, Alice [2026-03-25 21:24:54,436][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:24:54,500][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:24:54,501][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:24:54,502][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:24:55,256][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:24:55,873][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:24:56,532][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:24:57,190][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:24:57,852][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:24:58,511][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:24:59,170][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:24:59,829][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:25:00,489][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:25:01,148][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:25:01,807][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:25:02,466][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:25:03,124][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:25:03,782][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:25:04,440][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:25:05,100][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:25:05,758][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:25:06,416][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:25:07,074][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:25:07,732][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:25:08,389][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:25:09,047][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:25:09,706][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:25:10,366][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:25:11,025][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:25:11,684][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:25:12,342][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:25:13,000][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:25:13,659][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:25:14,317][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:25:14,975][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:25:15,634][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:25:16,293][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:25:16,952][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:25:17,610][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:25:18,268][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:25:18,927][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:25:19,585][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:25:20,243][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:25:20,901][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:25:21,560][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:25:22,219][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:25:22,878][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:25:23,874][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:25:24,533][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:25:25,193][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:25:25,853][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:25:26,512][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:25:27,170][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:25:27,828][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:25:28,487][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:25:29,145][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:25:29,803][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:25:30,461][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:25:31,121][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:25:31,780][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:25:32,438][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:25:33,096][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:25:33,754][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:25:34,412][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:25:35,070][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:25:35,728][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:25:36,386][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:25:37,044][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:25:37,702][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:25:38,525][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 21:25:40,292][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:25:40,294][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:25:40,295][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:25:41,753][__main__][INFO] - Iteration 476 took 54s (12.00% Gen, 85.32% Train). Generation: 6s, Training: 46s. Estimated remaining time: 7h 47m 53s. Estimated total time: 15h 7m 29s. Time estimates for 10 more iterations: 9m 4s, 100 more iterations: 1h 30m 44s, 500 more iterations: 7h 33m 44s. [2026-03-25 21:25:41,755][__main__][INFO] - Starting iteration 476. [2026-03-25 21:25:41,759][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 21:25:41,760][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:25:46,838][__main__][INFO] - Number of regex retries in iteration 476: 0 [2026-03-25 21:25:46,840][__main__][INFO] - agents played in iteration 476 are Bob, Alice [2026-03-25 21:25:47,441][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:25:47,502][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:25:47,503][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:25:47,503][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:25:48,278][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:25:48,884][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:25:49,546][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:25:50,205][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:25:50,863][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:25:51,521][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:25:52,179][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:25:52,839][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:25:53,497][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:25:54,156][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:25:54,814][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:25:55,472][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:25:56,138][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:25:56,796][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:25:57,455][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:25:58,114][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:25:58,774][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:25:59,433][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:26:00,092][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:26:00,751][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:26:01,410][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:26:02,068][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:26:02,726][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:26:03,385][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:26:04,042][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:26:04,701][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:26:05,360][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:26:06,018][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:26:06,675][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:26:07,332][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:26:07,990][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:26:08,648][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:26:09,305][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:26:09,963][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:26:10,622][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:26:11,280][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:26:11,938][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:26:12,597][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:26:13,255][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:26:13,913][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:26:14,571][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:26:15,229][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:26:15,887][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:26:16,544][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:26:17,202][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:26:17,860][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:26:18,519][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:26:19,177][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:26:20,205][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:26:20,865][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:26:21,523][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:26:22,181][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:26:22,839][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:26:23,497][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:26:24,157][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:26:24,816][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:26:25,474][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:26:26,132][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:26:26,791][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:26:27,448][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:26:28,107][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:26:28,766][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:26:29,424][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:26:30,084][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:26:30,741][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:26:31,558][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 21:26:32,955][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:26:32,957][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:26:32,959][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:26:34,403][__main__][INFO] - Iteration 477 took 52s (9.65% Gen, 87.60% Train). Generation: 5s, Training: 46s. Estimated remaining time: 7h 16m 57s. Estimated total time: 14h 37m 26s. Time estimates for 10 more iterations: 8m 46s, 100 more iterations: 1h 27m 44s, 500 more iterations: 7h 18m 43s. [2026-03-25 21:26:34,405][__main__][INFO] - Starting iteration 477. [2026-03-25 21:26:34,410][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 21:26:34,410][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:26:39,695][__main__][INFO] - Number of regex retries in iteration 477: 0 [2026-03-25 21:26:39,696][__main__][INFO] - agents played in iteration 477 are Bob, Alice [2026-03-25 21:26:40,313][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:26:40,379][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:26:40,379][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:26:40,380][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:26:41,145][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:26:41,753][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:26:42,413][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:26:43,074][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:26:43,733][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:26:44,391][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:26:45,051][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:26:45,712][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:26:46,370][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:26:47,029][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:26:47,687][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:26:48,347][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:26:49,005][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:26:49,665][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:26:50,323][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:26:50,981][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:26:51,639][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:26:52,297][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:26:52,956][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:26:53,614][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:26:54,274][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:26:54,932][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:26:55,590][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:26:56,248][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:26:56,906][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:26:57,564][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:26:58,222][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:26:58,881][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:26:59,541][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:27:00,200][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:27:00,858][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:27:01,516][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:27:02,174][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:27:02,832][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:27:03,490][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:27:04,149][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:27:04,808][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:27:05,466][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:27:06,124][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:27:06,782][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:27:07,440][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:27:08,098][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:27:08,757][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:27:09,415][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:27:10,073][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:27:10,734][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:27:11,393][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:27:12,052][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:27:13,049][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:27:13,708][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:27:14,368][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:27:15,026][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:27:15,684][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:27:16,343][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:27:17,000][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:27:17,661][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:27:18,319][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:27:18,979][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:27:19,637][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:27:20,296][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:27:20,955][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:27:21,615][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:27:22,275][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:27:22,933][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:27:23,592][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:27:24,456][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 21:27:25,487][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:27:25,490][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:27:25,491][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:27:27,082][__main__][INFO] - Iteration 478 took 52s (10.03% Gen, 86.94% Train). Generation: 5s, Training: 45s. Estimated remaining time: 7h 16m 33s. Estimated total time: 14h 37m 55s. Time estimates for 10 more iterations: 8m 46s, 100 more iterations: 1h 27m 47s, 500 more iterations: 7h 18m 57s. [2026-03-25 21:27:27,084][__main__][INFO] - Starting iteration 478. [2026-03-25 21:27:27,092][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 21:27:27,093][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:27:33,658][__main__][INFO] - Number of regex retries in iteration 478: 0 [2026-03-25 21:27:33,658][__main__][INFO] - agents played in iteration 478 are Bob, Alice [2026-03-25 21:27:34,866][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:27:34,928][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:27:34,929][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:27:34,930][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:27:35,742][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:27:36,349][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:27:37,007][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:27:37,665][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:27:38,323][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:27:38,982][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:27:39,643][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:27:40,303][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:27:40,960][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:27:41,619][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:27:42,278][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:27:42,940][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:27:43,598][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:27:44,928][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:27:45,587][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:27:46,246][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:27:46,906][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:27:47,564][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:27:48,222][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:27:48,881][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:27:49,540][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:27:50,199][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:27:50,858][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:27:51,517][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:27:52,177][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:27:52,836][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:27:53,495][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:27:54,155][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:27:54,814][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:27:55,478][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:27:56,136][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:27:56,795][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:27:57,452][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:27:58,110][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:27:58,768][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:27:59,427][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:28:00,085][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:28:00,743][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:28:01,401][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:28:02,059][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:28:02,717][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:28:03,375][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:28:04,033][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:28:04,693][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:28:05,352][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:28:06,011][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:28:06,669][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:28:07,326][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:28:08,346][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:28:09,005][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:28:09,664][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:28:10,322][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:28:10,982][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:28:11,642][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:28:12,301][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:28:12,959][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:28:13,617][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:28:14,275][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:28:14,934][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:28:15,592][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:28:16,250][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:28:16,908][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:28:17,566][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:28:18,225][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:28:18,883][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:28:19,722][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 21:28:21,580][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:28:21,583][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:28:21,584][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:28:23,056][__main__][INFO] - Iteration 479 took 55s (11.73% Gen, 85.63% Train). Generation: 6s, Training: 47s. Estimated remaining time: 8h 10m 28s. Estimated total time: 15h 32m 45s. Time estimates for 10 more iterations: 9m 19s, 100 more iterations: 1h 33m 16s, 500 more iterations: 7h 46m 22s. [2026-03-25 21:28:23,059][__main__][INFO] - Starting iteration 479. [2026-03-25 21:28:23,062][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 21:28:23,063][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:28:28,038][__main__][INFO] - Number of regex retries in iteration 479: 0 [2026-03-25 21:28:28,039][__main__][INFO] - agents played in iteration 479 are Bob, Alice [2026-03-25 21:28:28,661][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:28:28,722][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:28:28,722][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:28:28,723][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:28:29,560][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:28:30,179][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:28:30,838][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:28:31,496][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:28:32,154][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:28:32,812][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:28:33,471][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:28:34,128][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:28:34,787][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:28:35,446][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:28:36,105][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:28:36,763][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:28:37,422][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:28:38,080][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:28:38,740][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:28:39,399][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:28:40,059][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:28:40,717][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:28:41,376][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:28:42,035][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:28:42,694][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:28:43,353][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:28:44,013][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:28:44,671][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:28:45,332][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:28:45,994][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:28:46,653][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:28:47,311][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:28:47,972][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:28:48,630][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:28:49,289][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:28:49,948][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:28:50,607][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:28:51,265][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:28:51,923][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:28:52,582][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:28:53,243][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:28:53,902][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:28:54,560][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:28:55,218][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:28:55,875][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:28:56,535][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:28:57,195][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:28:57,853][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:28:58,512][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:28:59,170][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:28:59,831][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:29:00,490][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:29:01,481][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:29:02,142][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:29:02,800][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:29:03,460][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:29:04,119][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:29:04,778][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:29:05,438][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:29:06,097][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:29:06,755][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:29:07,415][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:29:08,073][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:29:08,731][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:29:09,389][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:29:10,048][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:29:10,707][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:29:11,366][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:29:12,026][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:29:12,802][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 21:29:14,723][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:29:14,729][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:29:14,730][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:29:16,205][__main__][INFO] - Iteration 480 took 53s (9.36% Gen, 87.86% Train). Generation: 4s, Training: 46s. Estimated remaining time: 7h 22m 33s. Estimated total time: 14h 45m 44s. Time estimates for 10 more iterations: 8m 51s, 100 more iterations: 1h 28m 34s, 500 more iterations: 7h 22m 52s. [2026-03-25 21:29:16,207][__main__][INFO] - Starting iteration 480. [2026-03-25 21:29:16,211][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 21:29:16,211][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:29:22,867][__main__][INFO] - Number of regex retries in iteration 480: 0 [2026-03-25 21:29:22,868][__main__][INFO] - agents played in iteration 480 are Bob, Alice [2026-03-25 21:29:24,210][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:29:24,271][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:29:24,272][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:29:24,272][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:29:24,930][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:29:25,544][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:29:26,206][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:29:26,864][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:29:27,523][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:29:28,183][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:29:28,845][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:29:29,507][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:29:30,168][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:29:30,829][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:29:31,488][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:29:32,145][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:29:32,806][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:29:33,466][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:29:34,125][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:29:34,783][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:29:35,446][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:29:36,104][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:29:36,765][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:29:37,425][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:29:38,083][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:29:38,743][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:29:39,402][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:29:40,061][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:29:40,720][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:29:41,379][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:29:42,036][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:29:42,694][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:29:43,354][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:29:44,012][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:29:44,672][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:29:45,330][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:29:45,987][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:29:46,648][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:29:47,307][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:29:47,965][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:29:48,623][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:29:49,282][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:29:49,941][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:29:50,600][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:29:51,259][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:29:51,917][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:29:52,574][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:29:53,232][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:29:53,894][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:29:54,552][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:29:55,210][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:29:55,868][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:29:56,849][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:29:57,509][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:29:58,169][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:29:58,829][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:29:59,489][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:30:00,148][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:30:00,807][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:30:01,468][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:30:02,128][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:30:02,787][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:30:03,447][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:30:04,106][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:30:04,765][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:30:05,427][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:30:06,087][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:30:06,746][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:30:07,405][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:30:08,278][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 21:30:10,135][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:30:10,140][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:30:10,142][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:30:11,769][__main__][INFO] - Iteration 481 took 55s (11.98% Gen, 85.09% Train). Generation: 6s, Training: 47s. Estimated remaining time: 8h 1m 53s. Estimated total time: 15h 25m 59s. Time estimates for 10 more iterations: 9m 15s, 100 more iterations: 1h 32m 35s, 500 more iterations: 7h 42m 59s. [2026-03-25 21:30:11,771][__main__][INFO] - Starting iteration 481. [2026-03-25 21:30:11,775][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 21:30:11,776][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:30:26,994][__main__][INFO] - Number of regex retries in iteration 481: 0 [2026-03-25 21:30:26,995][__main__][INFO] - agents played in iteration 481 are Bob, Alice [2026-03-25 21:30:28,202][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:30:28,264][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:30:28,265][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:30:28,265][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:30:28,918][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:30:29,526][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:30:30,186][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:30:30,845][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:30:31,508][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:30:32,165][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:30:32,824][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:30:33,483][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:30:34,142][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:30:34,800][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:30:35,459][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:30:36,118][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:30:36,776][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:30:37,435][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:30:38,093][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:30:38,753][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:30:39,411][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:30:40,073][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:30:40,731][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:30:41,390][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:30:42,049][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:30:42,708][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:30:43,366][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:30:44,025][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:30:44,684][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:30:45,342][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:30:46,001][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:30:46,661][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:30:47,320][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:30:47,978][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:30:48,636][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:30:49,296][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:30:49,956][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:30:50,614][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:30:51,272][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:30:51,931][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:30:52,589][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:30:53,248][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:30:53,907][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:30:54,566][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:30:55,226][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:30:55,886][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:30:56,545][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:30:57,205][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:30:57,865][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:30:58,523][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:30:59,182][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:30:59,841][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:31:00,825][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:31:01,487][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:31:02,147][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:31:02,804][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:31:03,463][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:31:04,121][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:31:04,779][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:31:05,436][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:31:06,094][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:31:06,752][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:31:07,410][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:31:08,069][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:31:08,728][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:31:09,386][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:31:10,045][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:31:10,704][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:31:11,363][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:31:12,179][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 21:31:13,584][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:31:13,586][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:31:13,588][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:31:15,107][__main__][INFO] - Iteration 482 took 1m 3s (24.03% Gen, 73.57% Train). Generation: 15s, Training: 46s. Estimated remaining time: 10h 10m 24s. Estimated total time: 17h 35m 33s. Time estimates for 10 more iterations: 10m 33s, 100 more iterations: 1h 45m 33s, 500 more iterations: 8h 47m 46s. [2026-03-25 21:31:15,109][__main__][INFO] - Starting iteration 482. [2026-03-25 21:31:15,113][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 21:31:15,114][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:31:20,019][__main__][INFO] - Number of regex retries in iteration 482: 0 [2026-03-25 21:31:20,020][__main__][INFO] - agents played in iteration 482 are Bob, Alice [2026-03-25 21:31:20,511][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:31:20,572][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:31:20,573][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:31:20,573][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:31:21,411][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:31:22,033][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:31:22,693][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:31:23,350][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:31:24,008][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:31:24,667][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:31:25,326][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:31:25,985][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:31:26,643][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:31:27,301][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:31:27,960][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:31:28,620][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:31:29,279][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:31:29,937][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:31:30,595][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:31:31,255][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:31:31,914][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:31:32,572][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:31:33,231][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:31:33,890][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:31:34,548][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:31:35,207][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:31:35,865][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:31:36,525][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:31:37,183][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:31:37,842][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:31:38,501][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:31:39,160][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:31:39,820][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:31:40,478][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:31:41,137][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:31:41,795][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:31:42,456][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:31:43,115][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:31:43,774][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:31:44,434][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:31:45,094][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:31:45,752][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:31:46,411][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:31:47,070][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:31:47,729][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:31:48,388][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:31:49,046][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:31:49,705][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:31:50,363][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:31:51,023][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:31:51,682][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:31:52,341][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:31:53,347][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:31:54,006][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:31:54,665][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:31:55,325][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:31:55,985][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:31:56,643][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:31:57,304][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:31:57,963][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:31:58,623][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:31:59,283][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:31:59,942][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:32:00,602][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:32:01,260][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:32:01,920][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:32:02,579][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:32:03,238][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:32:03,896][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:32:04,704][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 21:32:06,571][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:32:06,574][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:32:06,575][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:32:07,986][__main__][INFO] - Iteration 483 took 52s (9.28% Gen, 88.05% Train). Generation: 4s, Training: 46s. Estimated remaining time: 7h 15m 12s. Estimated total time: 14h 41m 15s. Time estimates for 10 more iterations: 8m 48s, 100 more iterations: 1h 28m 7s, 500 more iterations: 7h 20m 37s. [2026-03-25 21:32:07,989][__main__][INFO] - Starting iteration 483. [2026-03-25 21:32:07,993][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 21:32:07,993][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:32:13,175][__main__][INFO] - Number of regex retries in iteration 483: 0 [2026-03-25 21:32:13,177][__main__][INFO] - agents played in iteration 483 are Bob, Alice [2026-03-25 21:32:13,782][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:32:13,843][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:32:13,844][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:32:13,844][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:32:14,515][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:32:15,175][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:32:15,836][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:32:16,496][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:32:17,155][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:32:17,814][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:32:18,472][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:32:19,133][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:32:19,791][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:32:20,451][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:32:21,111][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:32:21,770][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:32:22,428][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:32:23,088][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:32:23,747][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:32:24,407][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:32:25,066][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:32:25,725][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:32:26,384][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:32:27,043][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:32:27,702][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:32:28,360][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:32:29,020][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:32:29,679][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:32:30,337][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:32:30,995][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:32:31,653][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:32:32,311][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:32:32,970][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:32:33,628][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:32:34,287][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:32:34,944][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:32:35,604][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:32:36,263][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:32:36,921][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:32:37,578][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:32:38,237][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:32:38,894][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:32:39,552][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:32:40,211][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:32:40,869][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:32:41,528][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:32:42,186][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:32:42,844][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:32:43,503][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:32:44,161][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:32:44,819][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:32:45,477][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:32:46,466][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:32:47,126][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:32:47,787][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:32:48,445][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:32:49,104][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:32:49,761][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:32:50,419][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:32:51,077][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:32:51,736][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:32:52,393][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:32:53,051][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:32:53,709][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:32:54,366][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:32:55,025][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:32:55,683][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:32:56,340][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:32:56,998][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:32:57,793][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 21:32:59,152][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:32:59,154][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:33:00,608][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:33:02,139][__main__][INFO] - Iteration 484 took 54s (9.57% Gen, 87.60% Train). Generation: 5s, Training: 47s. Estimated remaining time: 7h 35m 31s. Estimated total time: 15h 2m 28s. Time estimates for 10 more iterations: 9m 1s, 100 more iterations: 1h 30m 14s, 500 more iterations: 7h 31m 14s. [2026-03-25 21:33:02,141][__main__][INFO] - Starting iteration 484. [2026-03-25 21:33:02,153][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 21:33:02,154][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:33:10,545][__main__][INFO] - Number of regex retries in iteration 484: 0 [2026-03-25 21:33:10,546][__main__][INFO] - agents played in iteration 484 are Bob, Alice [2026-03-25 21:33:11,656][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:33:11,717][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:33:11,718][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:33:11,718][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:33:12,502][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:33:13,123][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:33:13,782][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:33:14,440][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:33:15,098][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:33:15,757][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:33:16,415][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:33:17,074][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:33:17,734][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:33:18,391][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:33:19,049][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:33:19,707][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:33:20,364][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:33:21,022][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:33:21,681][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:33:22,338][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:33:22,995][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:33:23,653][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:33:24,312][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:33:24,970][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:33:25,629][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:33:26,286][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:33:26,945][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:33:27,602][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:33:28,260][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:33:28,918][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:33:29,577][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:33:30,235][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:33:30,895][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:33:31,554][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:33:32,212][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:33:32,872][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:33:33,528][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:33:34,185][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:33:34,843][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:33:35,501][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:33:36,160][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:33:36,818][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:33:37,476][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:33:38,135][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:33:38,794][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:33:39,452][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:33:40,111][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:33:40,770][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:33:41,428][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:33:42,086][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:33:42,745][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:33:43,403][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:33:44,386][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:33:45,044][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:33:45,703][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:33:46,362][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:33:47,020][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:33:47,679][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:33:48,337][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:33:48,995][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:33:49,653][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:33:50,312][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:33:50,970][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:33:51,628][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:33:52,287][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:33:52,946][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:33:53,604][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:33:54,263][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:33:54,922][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:33:55,704][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 21:33:57,056][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:33:57,058][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:33:57,059][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:33:58,497][__main__][INFO] - Iteration 485 took 56s (14.89% Gen, 82.55% Train). Generation: 8s, Training: 46s. Estimated remaining time: 8h 11m 12s. Estimated total time: 15h 39m 5s. Time estimates for 10 more iterations: 9m 23s, 100 more iterations: 1h 33m 54s, 500 more iterations: 7h 49m 32s. [2026-03-25 21:33:58,499][__main__][INFO] - Starting iteration 485. [2026-03-25 21:33:58,503][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 21:33:58,503][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:34:03,230][__main__][INFO] - Number of regex retries in iteration 485: 0 [2026-03-25 21:34:03,231][__main__][INFO] - agents played in iteration 485 are Bob, Alice [2026-03-25 21:34:03,775][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:34:03,835][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:34:03,836][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:34:03,836][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:34:04,591][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:34:05,201][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:34:05,861][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:34:06,519][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:34:07,178][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:34:07,838][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:34:08,496][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:34:09,156][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:34:09,815][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:34:10,474][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:34:11,135][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:34:11,792][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:34:12,451][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:34:13,112][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:34:13,771][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:34:14,434][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:34:15,092][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:34:15,752][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:34:16,411][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:34:17,071][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:34:17,730][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:34:18,389][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:34:19,048][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:34:19,708][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:34:20,367][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:34:21,027][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:34:21,686][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:34:22,345][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:34:23,005][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:34:23,666][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:34:24,324][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:34:24,982][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:34:25,641][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:34:26,306][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:34:26,965][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:34:27,624][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:34:28,284][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:34:28,944][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:34:29,603][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:34:30,262][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:34:30,922][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:34:31,581][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:34:32,240][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:34:32,899][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:34:33,559][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:34:34,217][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:34:34,877][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:34:35,535][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:34:36,544][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:34:37,203][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:34:37,861][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:34:38,520][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:34:39,178][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:34:39,836][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:34:40,494][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:34:41,154][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:34:41,813][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:34:42,471][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:34:43,129][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:34:43,788][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:34:44,446][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:34:45,104][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:34:45,762][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:34:46,420][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:34:47,078][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:34:47,867][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 21:34:49,209][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:34:49,211][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:34:49,213][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:34:50,711][__main__][INFO] - Iteration 486 took 52s (9.06% Gen, 88.07% Train). Generation: 4s, Training: 45s. Estimated remaining time: 7h 1m 24s. Estimated total time: 14h 30m 9s. Time estimates for 10 more iterations: 8m 42s, 100 more iterations: 1h 27m 0s, 500 more iterations: 7h 15m 4s. [2026-03-25 21:34:50,713][__main__][INFO] - Starting iteration 486. [2026-03-25 21:34:50,716][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 21:34:50,717][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:34:55,626][__main__][INFO] - Number of regex retries in iteration 486: 0 [2026-03-25 21:34:55,627][__main__][INFO] - agents played in iteration 486 are Bob, Alice [2026-03-25 21:34:56,223][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:34:56,284][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:34:56,285][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:34:56,285][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:34:57,146][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:34:57,756][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:34:58,416][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:34:59,075][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:34:59,733][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:35:00,392][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:35:01,051][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:35:01,709][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:35:02,367][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:35:03,025][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:35:03,683][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:35:04,341][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:35:05,000][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:35:05,658][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:35:06,315][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:35:06,972][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:35:07,631][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:35:08,288][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:35:08,946][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:35:09,604][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:35:10,263][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:35:10,920][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:35:11,580][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:35:12,238][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:35:12,895][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:35:13,552][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:35:14,211][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:35:14,868][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:35:15,525][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:35:16,183][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:35:16,841][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:35:17,499][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:35:18,157][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:35:18,814][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:35:19,472][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:35:20,130][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:35:20,787][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:35:21,444][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:35:22,102][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:35:22,760][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:35:23,418][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:35:24,076][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:35:24,734][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:35:25,391][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:35:26,049][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:35:26,707][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:35:27,364][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:35:28,022][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:35:29,003][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:35:29,661][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:35:30,321][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:35:30,979][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:35:31,638][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:35:32,296][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:35:32,954][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:35:33,612][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:35:34,271][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:35:34,929][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:35:35,587][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:35:36,245][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:35:36,904][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:35:37,562][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:35:38,220][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:35:38,877][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:35:39,535][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:35:40,299][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 21:35:41,641][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:35:41,644][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:35:41,645][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:35:43,071][__main__][INFO] - Iteration 487 took 52s (9.38% Gen, 87.89% Train). Generation: 4s, Training: 46s. Estimated remaining time: 7h 2m 58s. Estimated total time: 14h 32m 35s. Time estimates for 10 more iterations: 8m 43s, 100 more iterations: 1h 27m 15s, 500 more iterations: 7h 16m 17s. [2026-03-25 21:35:43,073][__main__][INFO] - Starting iteration 487. [2026-03-25 21:35:43,076][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 21:35:43,077][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:35:47,835][__main__][INFO] - Number of regex retries in iteration 487: 0 [2026-03-25 21:35:47,836][__main__][INFO] - agents played in iteration 487 are Bob, Alice [2026-03-25 21:35:48,336][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:35:48,397][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:35:48,398][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:35:48,398][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:35:49,235][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:35:49,844][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:35:50,504][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:35:51,164][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:35:51,821][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:35:52,479][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:35:53,137][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:35:53,795][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:35:54,453][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:35:55,111][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:35:55,769][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:35:56,428][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:35:57,086][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:35:57,743][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:35:58,401][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:35:59,058][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:35:59,715][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:36:00,373][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:36:01,032][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:36:01,690][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:36:02,348][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:36:03,008][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:36:03,666][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:36:04,323][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:36:04,981][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:36:05,638][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:36:06,296][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:36:06,953][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:36:07,610][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:36:08,268][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:36:08,925][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:36:09,584][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:36:10,241][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:36:10,900][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:36:11,558][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:36:12,216][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:36:12,874][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:36:13,532][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:36:14,190][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:36:14,847][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:36:15,509][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:36:16,171][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:36:16,826][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:36:17,485][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:36:18,143][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:36:18,803][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:36:19,461][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:36:20,119][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:36:21,121][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:36:21,779][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:36:22,437][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:36:23,096][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:36:23,754][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:36:24,412][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:36:25,070][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:36:25,728][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:36:26,386][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:36:27,044][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:36:27,703][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:36:28,361][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:36:29,019][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:36:29,679][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:36:30,337][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:36:30,996][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:36:31,657][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:36:32,448][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 21:36:33,745][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:36:33,747][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:36:33,749][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:36:35,417][__main__][INFO] - Iteration 488 took 52s (9.09% Gen, 87.72% Train). Generation: 4s, Training: 45s. Estimated remaining time: 7h 1m 52s. Estimated total time: 14h 32m 22s. Time estimates for 10 more iterations: 8m 43s, 100 more iterations: 1h 27m 14s, 500 more iterations: 7h 16m 11s. [2026-03-25 21:36:35,420][__main__][INFO] - Starting iteration 488. [2026-03-25 21:36:35,425][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 21:36:35,425][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:36:41,251][__main__][INFO] - Number of regex retries in iteration 488: 0 [2026-03-25 21:36:41,252][__main__][INFO] - agents played in iteration 488 are Bob, Alice [2026-03-25 21:36:41,933][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:36:41,996][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:36:41,997][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:36:41,997][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:36:42,683][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:36:43,314][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:36:43,972][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:36:44,631][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:36:45,289][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:36:45,948][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:36:46,606][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:36:47,263][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:36:47,921][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:36:48,579][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:36:49,237][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:36:49,895][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:36:50,554][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:36:51,214][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:36:51,872][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:36:52,529][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:36:53,187][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:36:53,844][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:36:54,502][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:36:55,159][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:36:55,817][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:36:56,475][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:36:57,133][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:36:57,790][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:36:58,448][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:36:59,106][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:36:59,764][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:37:00,422][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:37:01,080][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:37:01,738][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:37:02,395][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:37:03,053][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:37:03,711][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:37:04,370][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:37:05,028][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:37:05,686][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:37:06,343][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:37:07,000][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:37:07,657][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:37:08,314][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:37:08,972][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:37:09,629][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:37:10,287][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:37:10,944][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:37:11,604][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:37:12,261][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:37:12,918][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:37:13,576][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:37:14,565][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:37:15,225][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:37:15,884][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:37:16,542][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:37:17,200][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:37:17,858][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:37:18,516][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:37:19,174][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:37:19,832][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:37:20,489][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:37:21,148][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:37:21,805][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:37:22,463][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:37:23,121][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:37:23,779][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:37:24,437][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:37:25,095][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:37:25,920][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 21:37:27,715][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:37:27,718][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:37:27,719][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:37:29,198][__main__][INFO] - Iteration 489 took 53s (10.84% Gen, 86.41% Train). Generation: 5s, Training: 46s. Estimated remaining time: 7h 24m 51s. Estimated total time: 14h 56m 15s. Time estimates for 10 more iterations: 8m 57s, 100 more iterations: 1h 29m 37s, 500 more iterations: 7h 28m 7s. [2026-03-25 21:37:29,200][__main__][INFO] - Starting iteration 489. [2026-03-25 21:37:29,204][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 21:37:29,205][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:37:35,032][__main__][INFO] - Number of regex retries in iteration 489: 0 [2026-03-25 21:37:35,033][__main__][INFO] - agents played in iteration 489 are Bob, Alice [2026-03-25 21:37:35,574][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:37:35,635][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:37:35,636][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:37:35,637][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:37:36,459][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:37:37,068][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:37:37,727][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:37:38,386][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:37:39,043][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:37:39,702][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:37:40,362][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:37:41,019][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:37:41,678][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:37:42,334][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:37:42,993][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:37:43,650][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:37:44,308][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:37:44,966][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:37:45,623][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:37:46,282][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:37:46,940][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:37:47,597][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:37:48,256][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:37:48,913][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:37:49,570][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:37:50,228][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:37:50,885][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:37:51,542][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:37:52,201][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:37:52,859][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:37:53,516][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:37:54,174][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:37:54,831][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:37:55,489][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:37:56,147][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:37:56,809][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:37:57,501][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:37:58,146][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:37:58,804][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:37:59,462][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:38:00,120][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:38:00,778][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:38:01,436][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:38:02,094][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:38:02,751][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:38:03,409][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:38:04,067][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:38:04,725][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:38:05,383][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:38:06,042][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:38:06,700][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:38:07,358][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:38:08,352][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:38:09,012][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:38:09,670][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:38:10,329][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:38:10,987][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:38:11,647][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:38:12,306][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:38:12,966][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:38:13,624][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:38:14,282][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:38:14,939][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:38:15,600][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:38:16,259][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:38:16,917][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:38:17,575][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:38:18,233][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:38:18,892][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:38:19,714][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 21:38:21,060][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:38:21,062][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:38:21,064][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:38:22,716][__main__][INFO] - Iteration 490 took 53s (10.89% Gen, 86.02% Train). Generation: 5s, Training: 46s. Estimated remaining time: 7h 19m 36s. Estimated total time: 14h 51m 53s. Time estimates for 10 more iterations: 8m 55s, 100 more iterations: 1h 29m 11s, 500 more iterations: 7h 25m 56s. [2026-03-25 21:38:22,719][__main__][INFO] - Starting iteration 490. [2026-03-25 21:38:22,725][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 21:38:22,726][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:38:27,910][__main__][INFO] - Number of regex retries in iteration 490: 0 [2026-03-25 21:38:27,911][__main__][INFO] - agents played in iteration 490 are Bob, Alice [2026-03-25 21:38:28,396][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:38:28,458][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:38:28,459][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:38:28,459][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:38:29,151][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:38:29,764][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:38:30,422][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:38:31,080][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:38:31,742][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:38:32,398][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:38:33,057][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:38:33,719][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:38:34,378][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:38:35,036][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:38:35,696][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:38:36,355][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:38:37,013][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:38:37,671][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:38:38,329][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:38:38,986][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:38:39,645][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:38:40,302][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:38:40,962][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:38:41,619][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:38:42,277][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:38:42,935][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:38:43,593][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:38:44,251][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:38:44,909][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:38:45,567][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:38:46,226][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:38:46,884][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:38:47,541][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:38:48,199][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:38:48,857][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:38:49,514][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:38:50,172][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:38:50,830][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:38:51,488][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:38:52,150][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:38:52,808][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:38:53,468][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:38:54,126][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:38:54,785][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:38:55,446][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:38:56,103][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:38:56,762][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:38:57,420][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:38:58,077][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:38:58,735][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:38:59,393][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:39:00,050][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:39:01,032][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:39:01,693][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:39:02,351][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:39:03,011][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:39:03,668][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:39:04,326][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:39:04,985][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:39:05,649][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:39:06,307][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:39:06,965][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:39:07,623][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:39:08,282][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:39:08,941][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:39:09,600][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:39:10,258][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:39:10,918][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:39:11,585][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:39:12,504][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 21:39:13,903][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:39:13,906][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:39:13,907][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:39:15,376][__main__][INFO] - Iteration 491 took 52s (9.85% Gen, 87.36% Train). Generation: 5s, Training: 45s. Estimated remaining time: 7h 4m 22s. Estimated total time: 14h 37m 32s. Time estimates for 10 more iterations: 8m 46s, 100 more iterations: 1h 27m 45s, 500 more iterations: 7h 18m 46s. [2026-03-25 21:39:15,379][__main__][INFO] - Starting iteration 491. [2026-03-25 21:39:15,382][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 21:39:15,383][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:39:21,296][__main__][INFO] - Number of regex retries in iteration 491: 0 [2026-03-25 21:39:21,297][__main__][INFO] - agents played in iteration 491 are Bob, Alice [2026-03-25 21:39:22,401][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:39:22,463][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:39:22,463][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:39:22,464][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:39:23,141][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:39:23,748][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:39:24,408][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:39:25,067][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:39:25,726][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:39:26,385][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:39:27,042][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:39:27,701][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:39:28,358][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:39:29,017][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:39:29,674][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:39:30,333][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:39:30,991][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:39:31,649][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:39:32,309][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:39:32,968][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:39:33,631][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:39:34,285][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:39:34,946][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:39:35,604][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:39:36,262][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:39:36,920][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:39:37,578][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:39:38,235][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:39:38,894][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:39:39,552][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:39:40,210][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:39:40,868][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:39:41,527][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:39:42,185][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:39:42,843][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:39:43,500][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:39:44,158][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:39:44,815][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:39:45,473][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:39:46,130][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:39:46,787][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:39:47,445][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:39:48,102][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:39:48,760][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:39:49,419][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:39:50,078][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:39:50,735][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:39:51,393][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:39:52,052][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:39:52,711][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:39:53,368][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:39:54,026][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:39:55,013][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:39:55,672][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:39:56,330][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:39:56,988][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:39:57,647][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:39:58,304][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:39:58,963][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:39:59,622][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:40:00,280][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:40:00,940][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:40:01,597][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:40:02,257][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:40:02,916][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:40:03,573][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:40:04,231][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:40:04,890][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:40:05,548][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:40:06,260][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 21:40:07,565][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:40:07,567][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:40:08,335][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:40:09,842][__main__][INFO] - Iteration 492 took 54s (10.86% Gen, 86.37% Train). Generation: 5s, Training: 47s. Estimated remaining time: 7h 33m 36s. Estimated total time: 15h 7m 41s. Time estimates for 10 more iterations: 9m 4s, 100 more iterations: 1h 30m 46s, 500 more iterations: 7h 33m 50s. [2026-03-25 21:40:09,844][__main__][INFO] - Starting iteration 492. [2026-03-25 21:40:09,848][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 21:40:09,849][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:40:15,930][__main__][INFO] - Number of regex retries in iteration 492: 0 [2026-03-25 21:40:15,932][__main__][INFO] - agents played in iteration 492 are Bob, Alice [2026-03-25 21:40:16,429][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:40:16,492][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:40:16,493][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:40:16,493][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:40:17,285][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:40:17,902][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:40:18,561][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:40:19,218][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:40:19,876][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:40:20,534][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:40:21,192][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:40:21,849][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:40:22,506][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:40:23,163][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:40:23,823][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:40:24,481][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:40:25,138][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:40:25,798][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:40:26,455][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:40:27,112][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:40:27,772][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:40:28,431][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:40:29,089][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:40:29,746][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:40:30,403][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:40:31,061][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:40:31,720][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:40:32,377][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:40:33,034][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:40:33,691][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:40:34,348][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:40:35,006][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:40:35,663][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:40:36,320][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:40:36,977][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:40:37,636][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:40:38,294][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:40:38,952][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:40:39,611][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:40:40,268][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:40:40,926][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:40:41,583][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:40:42,241][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:40:42,898][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:40:43,556][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:40:44,213][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:40:44,870][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:40:45,527][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:40:46,185][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:40:46,843][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:40:47,501][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:40:48,158][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:40:49,155][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:40:49,815][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:40:50,472][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:40:51,130][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:40:51,788][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:40:52,446][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:40:53,103][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:40:53,761][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:40:54,419][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:40:55,076][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:40:55,734][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:40:56,391][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:40:57,048][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:40:57,705][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:40:58,363][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:40:59,046][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:40:59,704][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:41:00,480][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 21:41:02,271][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:41:02,274][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:41:02,275][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:41:03,798][__main__][INFO] - Iteration 493 took 53s (11.28% Gen, 85.90% Train). Generation: 6s, Training: 46s. Estimated remaining time: 7h 24m 13s. Estimated total time: 14h 59m 12s. Time estimates for 10 more iterations: 8m 59s, 100 more iterations: 1h 29m 55s, 500 more iterations: 7h 29m 36s. [2026-03-25 21:41:03,800][__main__][INFO] - Starting iteration 493. [2026-03-25 21:41:03,807][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 21:41:03,807][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:41:08,725][__main__][INFO] - Number of regex retries in iteration 493: 0 [2026-03-25 21:41:08,726][__main__][INFO] - agents played in iteration 493 are Bob, Alice [2026-03-25 21:41:09,320][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:41:09,380][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:41:09,381][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:41:09,381][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:41:10,243][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:41:10,857][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:41:11,517][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:41:12,177][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:41:12,836][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:41:13,495][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:41:14,154][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:41:14,813][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:41:15,471][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:41:16,131][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:41:16,790][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:41:17,449][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:41:18,107][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:41:18,765][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:41:19,422][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:41:20,080][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:41:20,739][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:41:21,398][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:41:22,056][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:41:22,714][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:41:23,372][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:41:24,030][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:41:24,689][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:41:25,347][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:41:26,005][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:41:26,663][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:41:27,323][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:41:27,981][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:41:28,690][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:41:29,376][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:41:30,034][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:41:30,699][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:41:31,366][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:41:32,026][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:41:32,687][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:41:33,346][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:41:34,005][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:41:34,664][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:41:35,323][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:41:35,982][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:41:36,639][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:41:37,296][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:41:37,954][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:41:38,612][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:41:39,270][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:41:39,930][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:41:40,590][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:41:41,248][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:41:42,261][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:41:42,921][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:41:43,580][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:41:45,440][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:41:46,099][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:41:46,757][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:41:47,418][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:41:48,076][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:41:48,735][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:41:49,395][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:41:50,054][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:41:50,713][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:41:51,371][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:41:52,030][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:41:52,689][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:41:53,348][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:41:54,006][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:41:54,835][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:44 [2026-03-25 21:41:56,322][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:41:56,325][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:41:56,327][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:41:57,680][__main__][INFO] - Iteration 494 took 53s (9.13% Gen, 88.35% Train). Generation: 4s, Training: 47s. Estimated remaining time: 7h 22m 2s. Estimated total time: 14h 57m 54s. Time estimates for 10 more iterations: 8m 58s, 100 more iterations: 1h 29m 47s, 500 more iterations: 7h 28m 57s. [2026-03-25 21:41:57,693][__main__][INFO] - Starting iteration 494. [2026-03-25 21:41:57,722][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 21:41:57,723][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:42:02,542][__main__][INFO] - Number of regex retries in iteration 494: 0 [2026-03-25 21:42:02,543][__main__][INFO] - agents played in iteration 494 are Bob, Alice [2026-03-25 21:42:03,047][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:42:03,109][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:42:03,110][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:42:03,110][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:42:03,905][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:42:04,514][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:42:05,175][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:42:05,839][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:42:06,500][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:42:07,161][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:42:07,822][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:42:08,481][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:42:09,144][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:42:09,804][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:42:10,463][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:42:11,123][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:42:11,782][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:42:12,442][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:42:13,107][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:42:13,767][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:42:14,426][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:42:15,084][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:42:15,743][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:42:16,401][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:42:17,060][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:42:17,718][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:42:18,376][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:42:19,035][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:42:19,692][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:42:20,351][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:42:21,009][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:42:21,666][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:42:22,324][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:42:22,983][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:42:23,641][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:42:24,298][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:42:24,957][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:42:25,615][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:42:26,273][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:42:26,930][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:42:27,588][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:42:28,246][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:42:28,905][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:42:29,563][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:42:30,220][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:42:30,879][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:42:31,537][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:42:32,204][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:42:32,862][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:42:33,519][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:42:34,177][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:42:34,835][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:42:35,829][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:42:36,488][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:42:37,146][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:42:37,804][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:42:38,462][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:42:39,121][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:42:39,780][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:42:40,438][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:42:41,764][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:42:42,647][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:42:43,306][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:42:43,963][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:42:44,623][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:42:45,281][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:42:45,939][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:42:46,597][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:42:47,256][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:42:48,041][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:44 [2026-03-25 21:42:49,394][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:42:49,397][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:42:49,398][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:42:53,200][__main__][INFO] - Iteration 495 took 55s (8.69% Gen, 84.45% Train). Generation: 4s, Training: 46s. Estimated remaining time: 7h 47m 51s. Estimated total time: 15h 24m 39s. Time estimates for 10 more iterations: 9m 14s, 100 more iterations: 1h 32m 27s, 500 more iterations: 7h 42m 19s. [2026-03-25 21:42:53,203][__main__][INFO] - Starting iteration 495. [2026-03-25 21:42:53,209][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 21:42:53,210][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:42:59,149][__main__][INFO] - Number of regex retries in iteration 495: 0 [2026-03-25 21:42:59,150][__main__][INFO] - agents played in iteration 495 are Bob, Alice [2026-03-25 21:42:59,758][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:42:59,819][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:42:59,820][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:42:59,820][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:43:00,586][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:43:01,202][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:43:01,862][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:43:02,521][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:43:03,179][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:43:03,838][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:43:04,499][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:43:05,157][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:43:05,815][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:43:06,473][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:43:07,131][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:43:07,788][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:43:08,445][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:43:09,104][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:43:09,762][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:43:10,419][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:43:11,077][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:43:11,735][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:43:12,394][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:43:13,052][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:43:13,710][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:43:14,367][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:43:15,025][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:43:15,683][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:43:16,341][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:43:17,001][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:43:17,659][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:43:18,317][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:43:18,975][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:43:19,633][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:43:20,291][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:43:20,949][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:43:21,609][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:43:22,267][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:43:22,925][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:43:23,583][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:43:24,241][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:43:24,898][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:43:25,555][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:43:26,214][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:43:26,869][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:43:27,526][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:43:28,184][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:43:28,842][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:43:29,500][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:43:30,158][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:43:30,816][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:43:31,474][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:43:32,482][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:43:33,142][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:43:33,801][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:43:34,460][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:43:35,119][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:43:35,777][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:43:36,436][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:43:37,096][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:43:37,755][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:43:38,414][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:43:39,074][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:43:39,736][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:43:40,395][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:43:41,054][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:43:41,712][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:43:42,371][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:43:43,029][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:43:43,843][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 21:43:45,221][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:43:45,223][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:43:45,225][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:43:46,655][__main__][INFO] - Iteration 496 took 53s (11.11% Gen, 86.20% Train). Generation: 5s, Training: 46s. Estimated remaining time: 7h 13m 7s. Estimated total time: 14h 50m 48s. Time estimates for 10 more iterations: 8m 54s, 100 more iterations: 1h 29m 4s, 500 more iterations: 7h 25m 24s. [2026-03-25 21:43:46,658][__main__][INFO] - Starting iteration 496. [2026-03-25 21:43:46,661][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 21:43:46,662][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:43:51,307][__main__][INFO] - Number of regex retries in iteration 496: 0 [2026-03-25 21:43:51,308][__main__][INFO] - agents played in iteration 496 are Bob, Alice [2026-03-25 21:43:51,811][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:43:51,872][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:43:51,873][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:43:51,873][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:43:52,553][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:43:53,171][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:43:53,831][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:43:54,491][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:43:55,150][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:43:55,811][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:43:56,472][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:43:57,131][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:43:57,789][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:43:58,450][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:43:59,109][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:43:59,768][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:44:00,427][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:44:01,087][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:44:01,746][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:44:02,404][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:44:03,063][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:44:03,721][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:44:04,381][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:44:05,041][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:44:05,699][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:44:06,358][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:44:07,020][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:44:07,680][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:44:08,340][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:44:08,999][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:44:09,658][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:44:10,317][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:44:10,975][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:44:11,634][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:44:12,293][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:44:12,951][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:44:13,610][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:44:14,269][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:44:14,928][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:44:15,587][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:44:16,246][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:44:16,904][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:44:17,564][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:44:18,223][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:44:18,881][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:44:19,540][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:44:20,199][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:44:20,858][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:44:21,518][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:44:22,177][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:44:22,836][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:44:23,494][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:44:24,485][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:44:25,144][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:44:25,802][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:44:26,460][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:44:27,118][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:44:27,776][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:44:28,435][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:44:29,093][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:44:29,750][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:44:30,408][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:44:31,065][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:44:31,722][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:44:32,380][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:44:33,040][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:44:33,698][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:44:34,355][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:44:35,013][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:44:35,867][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 21:44:37,482][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:44:37,485][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:44:37,486][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:44:39,042][__main__][INFO] - Iteration 497 took 52s (8.87% Gen, 88.16% Train). Generation: 4s, Training: 46s. Estimated remaining time: 6h 54m 28s. Estimated total time: 14h 33m 2s. Time estimates for 10 more iterations: 8m 43s, 100 more iterations: 1h 27m 18s, 500 more iterations: 7h 16m 31s. [2026-03-25 21:44:39,044][__main__][INFO] - Starting iteration 497. [2026-03-25 21:44:39,048][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 21:44:39,049][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:44:43,830][__main__][INFO] - Number of regex retries in iteration 497: 0 [2026-03-25 21:44:43,832][__main__][INFO] - agents played in iteration 497 are Bob, Alice [2026-03-25 21:44:44,314][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:44:44,375][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:44:44,376][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:44:44,377][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:44:45,226][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:44:45,861][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:44:46,519][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:44:47,177][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:44:47,836][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:44:48,494][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:44:49,151][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:44:49,808][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:44:50,465][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:44:51,124][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:44:51,782][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:44:52,439][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:44:53,097][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:44:53,755][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:44:54,412][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:44:55,069][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:44:55,728][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:44:56,386][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:44:57,044][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:44:57,704][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:44:58,363][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:44:59,021][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:44:59,679][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:45:00,337][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:45:00,996][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:45:01,654][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:45:02,312][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:45:02,971][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:45:03,629][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:45:04,287][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:45:04,944][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:45:05,603][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:45:06,261][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:45:06,918][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:45:07,576][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:45:08,234][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:45:08,892][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:45:09,550][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:45:10,209][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:45:10,867][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:45:11,525][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:45:12,183][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:45:12,840][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:45:13,498][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:45:14,157][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:45:14,814][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:45:15,471][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:45:16,129][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:45:17,121][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:45:17,781][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:45:18,439][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:45:19,097][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:45:19,755][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:45:20,413][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:45:21,072][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:45:21,729][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:45:22,387][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:45:23,045][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:45:23,702][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:45:24,360][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:45:25,019][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:45:25,676][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:45:26,334][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:45:26,992][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:45:27,651][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:45:28,537][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 21:45:29,890][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:45:29,893][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:45:29,894][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:45:31,361][__main__][INFO] - Iteration 498 took 52s (9.14% Gen, 88.05% Train). Generation: 4s, Training: 46s. Estimated remaining time: 6h 52m 28s. Estimated total time: 14h 31m 54s. Time estimates for 10 more iterations: 8m 43s, 100 more iterations: 1h 27m 11s, 500 more iterations: 7h 15m 57s. [2026-03-25 21:45:31,363][__main__][INFO] - Starting iteration 498. [2026-03-25 21:45:31,367][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 21:45:31,367][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:45:41,522][__main__][INFO] - Number of regex retries in iteration 498: 0 [2026-03-25 21:45:41,523][__main__][INFO] - agents played in iteration 498 are Bob, Alice [2026-03-25 21:45:42,660][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:45:42,722][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:45:42,723][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:45:42,723][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:45:43,428][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:45:44,048][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:45:44,707][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:45:45,365][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:45:46,022][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:45:46,681][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:45:47,338][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:45:47,997][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:45:48,655][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:45:49,313][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:45:49,971][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:45:50,629][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:45:51,286][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:45:51,944][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:45:52,602][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:45:53,260][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:45:53,918][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:45:54,575][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:45:55,233][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:45:55,890][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:45:56,548][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:45:57,206][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:45:57,865][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:45:58,523][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:45:59,181][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:45:59,838][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:46:00,495][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:46:01,153][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:46:01,811][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:46:02,469][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:46:03,130][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:46:03,788][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:46:04,447][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:46:05,106][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:46:05,767][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:46:06,425][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:46:07,082][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:46:07,740][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:46:08,398][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:46:09,055][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:46:09,713][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:46:10,371][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:46:11,029][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:46:11,687][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:46:12,346][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:46:13,005][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:46:13,662][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:46:14,320][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:46:15,312][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:46:15,971][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:46:16,629][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:46:17,288][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:46:17,946][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:46:18,605][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:46:19,264][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:46:19,921][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:46:20,579][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:46:21,238][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:46:21,895][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:46:22,554][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:46:23,211][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:46:23,869][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:46:24,529][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:46:25,186][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:46:25,843][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:46:26,757][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 21:46:28,161][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:46:28,163][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:46:28,165][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:46:29,572][__main__][INFO] - Iteration 499 took 58s (17.45% Gen, 80.13% Train). Generation: 10s, Training: 46s. Estimated remaining time: 8h 29m 42s. Estimated total time: 16h 10m 6s. Time estimates for 10 more iterations: 9m 42s, 100 more iterations: 1h 37m 0s, 500 more iterations: 8h 5m 3s. [2026-03-25 21:46:29,574][__main__][INFO] - Starting iteration 499. [2026-03-25 21:46:29,578][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 21:46:29,578][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:46:37,914][__main__][INFO] - Number of regex retries in iteration 499: 0 [2026-03-25 21:46:37,915][__main__][INFO] - agents played in iteration 499 are Bob, Alice [2026-03-25 21:46:38,519][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:46:38,582][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:46:38,582][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:46:38,583][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:46:39,405][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:46:40,021][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:46:40,680][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:46:41,340][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:46:41,999][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:46:42,657][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:46:43,317][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:46:43,979][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:46:44,636][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:46:45,295][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:46:45,953][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:46:46,611][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:46:47,268][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:46:47,926][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:46:48,584][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:46:49,242][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:46:49,900][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:46:50,559][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:46:51,217][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:46:51,874][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:46:52,533][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:46:53,191][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:46:53,850][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:46:54,508][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:46:55,169][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:46:55,832][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:46:56,492][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:46:57,149][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:46:57,808][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:46:58,466][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:46:59,125][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:46:59,782][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:47:00,440][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:47:01,098][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:47:01,757][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:47:02,416][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:47:03,073][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:47:03,732][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:47:04,389][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:47:05,047][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:47:05,706][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:47:06,363][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:47:07,021][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:47:07,680][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:47:08,338][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:47:08,995][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:47:09,653][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:47:10,312][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:47:11,351][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:47:12,011][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:47:12,669][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:47:13,326][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:47:13,984][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:47:14,643][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:47:15,300][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:47:15,959][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:47:16,617][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:47:17,276][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:47:17,934][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:47:18,592][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:47:19,250][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:47:19,908][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:47:20,565][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:47:21,223][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:47:21,880][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:47:22,701][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 21:47:24,597][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:47:24,600][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:47:24,601][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:47:25,997][__main__][INFO] - Iteration 500 took 56s (14.78% Gen, 82.75% Train). Generation: 8s, Training: 46s. Estimated remaining time: 7h 59m 0s. Estimated total time: 15h 40m 20s. Time estimates for 10 more iterations: 9m 24s, 100 more iterations: 1h 34m 2s, 500 more iterations: 7h 50m 10s. [2026-03-25 21:47:25,999][__main__][INFO] - Starting iteration 500. [2026-03-25 21:47:26,003][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 21:47:26,003][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:47:31,066][__main__][INFO] - Number of regex retries in iteration 500: 0 [2026-03-25 21:47:31,068][__main__][INFO] - agents played in iteration 500 are Bob, Alice [2026-03-25 21:47:31,562][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:47:31,624][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:47:31,624][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:47:31,625][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:47:32,436][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:47:33,042][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:47:33,702][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:47:34,361][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:47:35,021][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:47:35,679][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:47:36,338][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:47:36,997][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:47:37,656][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:47:38,314][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:47:38,973][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:47:39,632][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:47:40,291][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:47:40,949][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:47:41,607][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:47:42,266][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:47:42,926][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:47:43,584][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:47:44,243][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:47:44,901][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:47:45,559][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:47:46,221][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:47:46,880][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:47:47,538][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:47:48,196][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:47:48,855][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:47:49,513][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:47:50,171][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:47:50,828][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:47:51,486][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:47:52,144][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:47:52,801][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:47:53,459][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:47:54,116][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:47:54,775][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:47:55,434][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:47:56,092][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:47:56,750][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:47:57,408][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:47:58,066][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:47:58,725][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:47:59,383][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:48:00,043][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:48:00,701][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:48:01,359][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:48:02,017][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:48:02,675][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:48:03,333][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:48:04,318][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:48:04,978][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:48:05,637][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:48:06,296][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:48:06,955][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:48:07,614][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:48:08,273][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:48:08,932][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:48:09,590][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:48:10,248][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:48:10,907][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:48:11,566][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:48:12,224][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:48:12,882][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:48:13,540][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:48:14,198][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:48:14,857][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:48:16,335][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 21:48:17,654][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:48:17,656][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:48:17,657][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:48:20,642][__main__][INFO] - Iteration 501 took 54s (9.27% Gen, 85.27% Train). Generation: 5s, Training: 46s. Estimated remaining time: 7h 28m 25s. Estimated total time: 15h 10m 40s. Time estimates for 10 more iterations: 9m 6s, 100 more iterations: 1h 31m 4s, 500 more iterations: 7h 35m 20s. [2026-03-25 21:48:20,644][__main__][INFO] - Starting iteration 501. [2026-03-25 21:48:20,648][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 21:48:20,648][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:48:26,530][__main__][INFO] - Number of regex retries in iteration 501: 0 [2026-03-25 21:48:26,530][__main__][INFO] - agents played in iteration 501 are Bob, Alice [2026-03-25 21:48:27,609][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:48:27,670][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:48:27,670][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:48:27,671][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:48:28,490][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:48:29,102][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:48:29,762][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:48:30,419][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:48:31,079][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:48:31,738][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:48:32,396][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:48:33,055][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:48:33,713][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:48:34,371][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:48:35,029][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:48:35,687][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:48:36,346][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:48:37,004][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:48:37,662][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:48:38,320][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:48:38,978][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:48:39,635][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:48:40,293][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:48:40,952][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:48:41,610][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:48:42,268][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:48:42,925][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:48:43,584][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:48:44,242][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:48:44,900][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:48:45,558][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:48:46,215][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:48:46,873][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:48:47,531][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:48:48,189][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:48:48,847][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:48:49,504][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:48:50,163][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:48:50,821][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:48:51,478][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:48:52,137][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:48:52,794][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:48:53,452][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:48:54,111][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:48:54,768][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:48:55,427][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:48:56,084][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:48:56,742][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:48:57,400][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:48:58,057][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:48:58,715][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:48:59,374][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:49:00,367][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:49:01,029][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:49:01,688][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:49:02,349][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:49:03,007][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:49:03,665][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:49:04,324][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:49:04,983][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:49:05,646][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:49:06,303][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:49:06,968][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:49:07,800][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:49:08,757][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:49:09,619][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:49:10,277][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:49:10,937][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:49:11,596][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:49:12,414][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 21:49:13,780][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:49:13,783][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:49:13,784][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:49:15,351][__main__][INFO] - Iteration 502 took 54s (10.75% Gen, 86.38% Train). Generation: 5s, Training: 47s. Estimated remaining time: 7h 28m 34s. Estimated total time: 15h 11m 44s. Time estimates for 10 more iterations: 9m 7s, 100 more iterations: 1h 31m 10s, 500 more iterations: 7h 35m 52s. [2026-03-25 21:49:15,353][__main__][INFO] - Starting iteration 502. [2026-03-25 21:49:15,356][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 21:49:15,357][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:49:20,349][__main__][INFO] - Number of regex retries in iteration 502: 0 [2026-03-25 21:49:20,350][__main__][INFO] - agents played in iteration 502 are Bob, Alice [2026-03-25 21:49:20,839][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:49:20,908][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:49:20,908][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:49:20,909][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:49:21,704][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:49:22,315][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:49:22,975][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:49:23,633][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:49:24,292][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:49:24,950][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:49:25,608][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:49:26,266][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:49:26,925][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:49:27,585][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:49:28,243][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:49:28,903][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:49:29,560][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:49:30,218][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:49:32,399][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:49:33,057][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:49:33,715][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:49:34,373][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:49:35,032][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:49:35,690][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:49:36,348][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:49:37,006][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:49:37,664][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:49:38,323][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:49:38,980][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:49:39,639][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:49:40,297][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:49:40,955][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:49:41,613][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:49:42,271][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:49:42,930][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:49:43,588][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:49:44,247][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:49:44,904][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:49:45,563][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:49:46,222][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:49:46,881][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:49:47,540][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:49:48,197][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:49:48,856][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:49:49,513][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:49:50,171][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:49:50,830][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:49:51,491][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:49:52,149][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:49:52,808][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:49:53,466][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:49:54,126][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:49:55,110][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:49:55,770][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:49:56,428][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:49:57,093][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:49:57,750][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:49:58,409][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:49:59,066][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:49:59,724][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:50:00,384][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:50:01,042][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:50:01,702][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:50:02,360][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:50:03,019][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:50:03,678][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:50:04,338][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:50:04,997][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:50:05,655][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:50:06,443][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:44 [2026-03-25 21:50:08,259][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:50:08,261][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:50:08,263][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:50:09,734][__main__][INFO] - Iteration 503 took 54s (9.18% Gen, 88.11% Train). Generation: 4s, Training: 47s. Estimated remaining time: 7h 22m 14s. Estimated total time: 15h 6m 19s. Time estimates for 10 more iterations: 9m 3s, 100 more iterations: 1h 30m 37s, 500 more iterations: 7h 33m 9s. [2026-03-25 21:50:09,736][__main__][INFO] - Starting iteration 503. [2026-03-25 21:50:09,740][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 21:50:09,741][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:50:14,682][__main__][INFO] - Number of regex retries in iteration 503: 0 [2026-03-25 21:50:14,684][__main__][INFO] - agents played in iteration 503 are Bob, Alice [2026-03-25 21:50:15,176][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:50:15,237][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:50:15,237][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:50:15,238][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:50:16,061][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:50:16,670][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:50:17,329][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:50:17,989][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:50:18,646][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:50:19,304][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:50:19,964][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:50:20,623][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:50:21,281][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:50:21,940][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:50:22,599][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:50:23,256][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:50:23,915][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:50:24,573][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:50:25,231][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:50:25,889][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:50:26,548][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:50:27,207][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:50:27,866][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:50:28,525][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:50:29,184][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:50:29,843][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:50:30,502][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:50:31,161][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:50:31,819][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:50:32,477][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:50:33,135][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:50:33,793][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:50:34,451][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:50:35,109][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:50:35,767][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:50:36,426][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:50:37,083][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:50:37,742][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:50:38,400][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:50:39,057][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:50:39,716][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:50:40,375][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:50:41,033][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:50:41,690][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:50:42,351][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:50:43,008][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:50:43,664][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:50:44,323][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:50:44,981][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:50:45,639][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:50:46,297][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:50:46,955][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:50:47,935][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:50:48,595][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:50:49,254][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:50:49,912][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:50:50,570][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:50:51,228][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:50:51,893][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:50:52,549][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:50:53,207][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:50:53,867][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:50:54,525][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:50:55,183][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:50:55,841][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:50:56,500][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:50:57,158][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:50:57,816][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:50:58,474][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:50:59,271][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 21:51:00,676][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:51:00,678][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:51:00,679][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:51:02,121][__main__][INFO] - Iteration 504 took 52s (9.44% Gen, 87.80% Train). Generation: 4s, Training: 45s. Estimated remaining time: 6h 48m 6s. Estimated total time: 14h 33m 3s. Time estimates for 10 more iterations: 8m 43s, 100 more iterations: 1h 27m 18s, 500 more iterations: 7h 16m 31s. [2026-03-25 21:51:02,123][__main__][INFO] - Starting iteration 504. [2026-03-25 21:51:02,127][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 21:51:02,128][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:51:06,940][__main__][INFO] - Number of regex retries in iteration 504: 0 [2026-03-25 21:51:06,941][__main__][INFO] - agents played in iteration 504 are Bob, Alice [2026-03-25 21:51:08,000][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:51:08,062][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:51:08,062][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:51:08,063][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:51:08,882][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:51:09,498][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:51:10,157][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:51:10,815][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:51:11,475][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:51:12,133][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:51:12,792][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:51:13,450][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:51:14,107][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:51:14,765][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:51:15,425][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:51:16,084][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:51:16,745][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:51:17,403][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:51:18,061][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:51:18,721][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:51:19,379][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:51:20,037][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:51:20,696][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:51:21,353][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:51:22,011][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:51:22,670][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:51:23,327][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:51:23,986][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:51:24,644][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:51:25,301][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:51:25,960][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:51:26,617][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:51:27,275][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:51:27,933][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:51:28,593][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:51:29,250][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:51:29,908][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:51:30,566][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:51:31,224][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:51:31,882][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:51:32,539][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:51:33,197][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:51:33,855][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:51:34,514][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:51:35,172][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:51:35,829][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:51:36,486][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:51:37,145][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:51:37,805][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:51:38,464][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:51:39,123][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:51:39,782][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:51:40,783][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:51:41,443][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:51:42,102][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:51:42,763][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:51:43,422][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:51:44,080][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:51:44,738][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:51:45,396][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:51:46,055][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:51:46,713][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:51:47,371][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:51:48,029][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:51:48,688][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:51:49,346][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:51:50,004][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:51:50,662][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:51:51,320][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:51:52,130][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 21:51:53,531][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:51:53,534][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:51:53,535][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:51:55,014][__main__][INFO] - Iteration 505 took 52s (9.04% Gen, 88.10% Train). Generation: 4s, Training: 46s. Estimated remaining time: 6h 55m 39s. Estimated total time: 14h 41m 28s. Time estimates for 10 more iterations: 8m 48s, 100 more iterations: 1h 28m 8s, 500 more iterations: 7h 20m 44s. [2026-03-25 21:51:55,016][__main__][INFO] - Starting iteration 505. [2026-03-25 21:51:55,021][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 21:51:55,021][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:52:00,127][__main__][INFO] - Number of regex retries in iteration 505: 0 [2026-03-25 21:52:00,129][__main__][INFO] - agents played in iteration 505 are Bob, Alice [2026-03-25 21:52:00,624][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:52:00,685][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:52:00,685][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:52:00,686][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:52:01,524][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:52:02,132][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:52:02,799][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:52:03,457][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:52:04,115][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:52:04,773][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:52:05,432][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:52:06,091][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:52:06,749][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:52:07,406][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:52:08,064][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:52:08,721][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:52:09,380][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:52:10,037][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:52:10,696][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:52:11,354][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:52:12,012][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:52:12,671][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:52:13,329][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:52:13,986][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:52:14,644][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:52:15,301][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:52:15,959][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:52:16,618][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:52:17,277][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:52:17,934][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:52:18,592][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:52:19,249][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:52:19,907][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:52:20,565][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:52:21,223][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:52:21,881][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:52:22,539][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:52:23,197][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:52:23,856][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:52:24,513][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:52:25,172][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:52:25,830][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:52:26,488][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:52:27,146][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:52:27,803][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:52:28,461][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:52:29,119][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:52:29,777][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:52:30,435][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:52:31,093][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:52:31,752][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:52:32,409][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:52:33,395][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:52:34,054][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:52:34,713][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:52:35,371][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:52:36,029][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:52:36,688][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:52:37,346][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:52:38,004][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:52:38,662][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:52:39,322][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:52:39,980][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:52:40,638][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:52:41,297][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:52:41,956][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:52:42,614][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:52:43,272][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:52:43,930][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:52:44,711][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 21:52:46,044][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:52:46,047][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:52:46,048][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:52:47,566][__main__][INFO] - Iteration 506 took 52s (9.72% Gen, 87.39% Train). Generation: 5s, Training: 45s. Estimated remaining time: 6h 49m 5s. Estimated total time: 14h 35m 47s. Time estimates for 10 more iterations: 8m 45s, 100 more iterations: 1h 27m 34s, 500 more iterations: 7h 17m 53s. [2026-03-25 21:52:47,568][__main__][INFO] - Starting iteration 506. [2026-03-25 21:52:47,572][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 21:52:47,573][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:52:53,983][__main__][INFO] - Number of regex retries in iteration 506: 0 [2026-03-25 21:52:53,984][__main__][INFO] - agents played in iteration 506 are Bob, Alice [2026-03-25 21:52:54,589][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:52:54,650][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:52:54,651][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:52:54,651][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:52:55,458][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:52:56,065][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:52:56,725][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:52:57,383][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:52:58,042][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:52:58,701][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:52:59,359][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:53:00,019][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:53:00,677][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:53:01,336][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:53:01,995][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:53:02,655][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:53:03,315][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:53:03,973][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:53:04,630][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:53:05,287][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:53:05,945][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:53:06,604][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:53:07,263][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:53:07,921][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:53:08,579][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:53:09,239][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:53:09,899][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:53:10,558][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:53:11,219][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:53:11,879][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:53:12,538][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:53:13,198][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:53:13,859][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:53:14,520][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:53:15,182][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:53:15,837][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:53:16,497][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:53:17,155][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:53:17,812][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:53:18,470][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:53:19,128][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:53:19,786][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:53:20,444][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:53:21,102][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:53:21,762][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:53:22,420][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:53:23,078][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:53:23,736][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:53:24,393][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:53:25,050][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:53:25,708][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:53:26,366][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:53:27,357][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:53:28,016][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:53:28,675][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:53:29,333][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:53:29,992][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:53:30,649][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:53:31,308][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:53:31,969][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:53:32,626][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:53:33,287][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:53:33,945][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:53:34,605][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:53:35,265][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:53:35,925][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:53:36,588][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:53:37,246][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:53:37,905][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:53:38,586][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 21:53:39,764][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:53:39,766][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:53:39,768][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:53:41,125][__main__][INFO] - Iteration 507 took 53s (11.97% Gen, 85.49% Train). Generation: 6s, Training: 45s. Estimated remaining time: 7h 4m 59s. Estimated total time: 14h 52m 34s. Time estimates for 10 more iterations: 8m 55s, 100 more iterations: 1h 29m 15s, 500 more iterations: 7h 26m 17s. [2026-03-25 21:53:41,127][__main__][INFO] - Starting iteration 507. [2026-03-25 21:53:41,131][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 21:53:41,131][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:53:45,869][__main__][INFO] - Number of regex retries in iteration 507: 0 [2026-03-25 21:53:45,870][__main__][INFO] - agents played in iteration 507 are Bob, Alice [2026-03-25 21:53:46,359][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:53:46,419][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:53:46,420][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:53:46,420][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:53:47,140][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:53:47,757][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:53:48,418][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:53:49,077][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:53:49,735][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:53:50,393][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:53:51,058][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:53:51,716][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:53:52,376][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:53:53,034][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:53:53,693][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:53:54,353][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:53:55,012][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:53:55,672][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:53:56,331][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:53:56,990][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:53:57,649][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:53:58,308][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:53:58,967][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:53:59,626][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:54:00,285][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:54:00,951][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:54:01,608][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:54:02,270][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:54:02,932][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:54:03,593][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:54:04,256][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:54:04,915][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:54:05,576][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:54:06,237][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:54:06,895][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:54:07,553][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:54:08,211][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:54:08,869][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:54:09,528][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:54:10,187][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:54:10,845][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:54:11,504][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:54:12,163][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:54:12,823][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:54:13,490][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:54:14,149][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:54:14,811][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:54:15,469][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:54:16,131][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:54:16,788][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:54:17,447][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:54:18,107][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:54:19,104][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:54:19,764][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:54:20,422][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:54:21,080][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:54:21,738][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:54:22,395][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:54:23,055][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:54:23,713][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:54:24,370][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:54:25,030][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:54:25,688][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:54:26,350][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:54:27,009][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:54:27,667][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:54:28,326][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:54:28,985][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:54:29,643][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:54:30,446][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 21:54:31,879][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:54:31,881][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:54:31,882][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:54:33,389][__main__][INFO] - Iteration 508 took 52s (9.07% Gen, 88.05% Train). Generation: 4s, Training: 46s. Estimated remaining time: 6h 42m 31s. Estimated total time: 14h 30m 59s. Time estimates for 10 more iterations: 8m 42s, 100 more iterations: 1h 27m 5s, 500 more iterations: 7h 15m 29s. [2026-03-25 21:54:33,391][__main__][INFO] - Starting iteration 508. [2026-03-25 21:54:33,395][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 21:54:33,396][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:54:39,318][__main__][INFO] - Number of regex retries in iteration 508: 0 [2026-03-25 21:54:39,319][__main__][INFO] - agents played in iteration 508 are Bob, Alice [2026-03-25 21:54:40,383][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:54:40,443][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:54:40,444][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:54:40,444][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:54:41,108][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:54:41,715][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:54:42,375][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:54:43,038][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:54:43,699][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:54:44,359][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:54:45,018][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:54:45,678][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:54:46,337][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:54:46,996][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:54:47,655][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:54:48,314][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:54:49,930][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:54:51,225][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:54:51,883][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:54:52,541][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:54:53,199][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:54:53,857][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:54:54,514][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:54:55,172][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:54:55,830][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:54:56,488][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:54:57,145][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:54:57,803][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:54:58,461][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:54:59,120][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:54:59,778][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:55:00,435][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:55:01,093][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:55:01,750][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:55:02,407][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:55:03,066][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:55:03,724][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:55:04,382][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:55:05,039][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:55:05,696][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:55:06,354][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:55:07,012][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:55:07,670][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:55:08,329][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:55:08,987][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:55:09,644][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:55:10,303][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:55:10,961][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:55:11,618][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:55:12,276][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:55:12,933][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:55:13,592][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:55:14,573][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:55:15,232][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:55:15,890][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:55:16,548][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:55:17,206][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:55:17,865][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:55:18,523][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:55:19,182][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:55:19,841][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:55:21,605][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:55:22,263][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:55:22,922][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:55:23,581][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:55:24,239][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:55:24,897][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:55:25,556][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:55:26,214][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:55:27,002][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:45 [2026-03-25 21:55:29,519][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:55:29,521][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:55:29,523][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:55:30,984][__main__][INFO] - Iteration 509 took 57s (10.28% Gen, 87.17% Train). Generation: 5s, Training: 50s. Estimated remaining time: 8h 10m 25s. Estimated total time: 15h 59m 50s. Time estimates for 10 more iterations: 9m 35s, 100 more iterations: 1h 35m 59s, 500 more iterations: 7h 59m 55s. [2026-03-25 21:55:30,987][__main__][INFO] - Starting iteration 509. [2026-03-25 21:55:30,991][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 21:55:30,992][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:55:35,966][__main__][INFO] - Number of regex retries in iteration 509: 0 [2026-03-25 21:55:35,966][__main__][INFO] - agents played in iteration 509 are Bob, Alice [2026-03-25 21:55:36,485][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:55:36,545][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:55:36,545][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:55:36,546][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:55:37,389][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:55:37,995][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:55:38,654][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:55:39,315][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:55:39,973][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:55:40,632][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:55:41,292][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:55:41,949][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:55:42,608][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:55:43,266][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:55:43,923][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:55:44,581][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:55:45,239][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:55:45,897][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:55:46,556][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:55:47,214][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:55:47,872][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:55:48,531][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:55:49,190][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:55:49,847][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:55:50,505][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:55:51,163][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:55:51,822][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:55:52,479][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:55:53,137][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:55:53,795][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:55:54,453][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:55:55,111][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:55:55,768][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:55:56,427][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:55:57,085][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:55:57,744][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:55:58,401][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:55:59,060][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:55:59,718][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:56:00,376][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:56:01,034][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:56:01,692][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:56:02,351][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:56:03,009][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:56:03,667][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:56:04,326][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:56:04,984][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:56:05,642][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:56:06,300][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:56:06,960][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:56:07,618][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:56:08,276][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:56:09,262][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:56:09,922][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:56:10,579][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:56:11,239][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:56:11,896][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:56:12,553][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:56:13,211][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:56:13,871][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:56:14,530][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:56:15,187][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:56:15,844][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:56:16,502][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:56:17,160][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:56:17,818][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:56:18,478][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:56:19,135][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:56:19,794][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:56:20,515][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 21:56:21,851][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:56:21,854][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:56:21,855][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:56:23,559][__main__][INFO] - Iteration 510 took 52s (9.46% Gen, 87.29% Train). Generation: 4s, Training: 45s. Estimated remaining time: 6h 45m 52s. Estimated total time: 14h 36m 10s. Time estimates for 10 more iterations: 8m 45s, 100 more iterations: 1h 27m 37s, 500 more iterations: 7h 18m 5s. [2026-03-25 21:56:23,562][__main__][INFO] - Starting iteration 510. [2026-03-25 21:56:23,565][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 21:56:23,565][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:56:28,119][__main__][INFO] - Number of regex retries in iteration 510: 0 [2026-03-25 21:56:28,120][__main__][INFO] - agents played in iteration 510 are Bob, Alice [2026-03-25 21:56:28,727][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:56:28,788][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:56:28,788][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:56:28,789][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:56:29,572][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:56:30,188][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:56:30,848][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:56:31,507][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:56:32,166][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:56:32,825][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:56:33,483][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:56:34,144][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:56:34,804][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:56:35,464][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:56:36,121][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:56:36,781][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:56:37,440][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:56:38,096][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:56:38,756][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:56:39,414][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:56:40,074][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:56:40,732][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:56:41,391][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:56:42,050][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:56:42,708][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:56:43,366][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:56:44,026][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:56:44,685][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:56:45,344][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:56:46,001][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:56:46,660][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:56:47,318][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:56:47,979][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:56:48,636][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:56:49,294][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:56:49,952][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:56:50,610][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:56:51,268][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:56:51,927][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:56:52,585][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:56:53,243][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:56:53,902][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:56:54,561][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:56:55,220][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:56:55,877][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:56:56,535][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:56:57,193][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:56:57,851][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:56:58,510][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:56:59,170][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:56:59,827][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:57:00,486][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:57:01,485][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:57:02,144][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:57:02,802][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:57:03,460][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:57:04,118][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:57:04,776][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:57:05,434][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:57:06,093][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:57:06,753][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:57:07,412][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:57:08,071][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:57:08,729][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:57:09,387][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:57:10,046][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:57:10,704][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:57:11,362][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:57:12,019][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:57:12,839][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 21:57:13,876][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:57:13,878][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:57:13,879][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:57:15,326][__main__][INFO] - Iteration 511 took 51s (8.80% Gen, 88.40% Train). Generation: 4s, Training: 45s. Estimated remaining time: 6h 31m 32s. Estimated total time: 14h 22m 42s. Time estimates for 10 more iterations: 8m 37s, 100 more iterations: 1h 26m 16s, 500 more iterations: 7h 11m 21s. [2026-03-25 21:57:15,328][__main__][INFO] - Starting iteration 511. [2026-03-25 21:57:15,332][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 21:57:15,332][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:57:19,961][__main__][INFO] - Number of regex retries in iteration 511: 0 [2026-03-25 21:57:19,963][__main__][INFO] - agents played in iteration 511 are Bob, Alice [2026-03-25 21:57:20,809][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:57:20,870][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:57:20,870][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:57:20,871][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:57:21,680][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:57:22,296][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:57:22,957][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:57:23,617][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:57:24,276][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:57:24,936][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:57:25,594][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:57:26,253][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:57:26,910][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:57:27,569][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:57:28,227][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:57:28,884][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:57:29,544][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:57:30,203][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:57:30,861][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:57:31,518][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:57:32,177][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:57:32,834][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:57:33,492][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:57:34,150][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:57:34,809][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:57:35,468][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:57:36,126][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:57:36,783][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:57:37,441][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:57:38,099][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:57:38,757][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:57:39,415][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:57:40,073][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:57:40,732][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:57:41,391][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:57:42,049][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:57:42,707][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:57:43,364][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:57:44,022][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:57:44,680][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:57:45,337][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:57:45,995][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:57:46,653][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:57:47,310][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:57:47,969][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:57:48,627][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:57:49,286][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:57:49,944][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:57:50,602][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:57:51,260][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:57:51,918][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:57:52,577][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:57:53,570][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:57:54,228][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:57:54,887][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:57:55,545][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:57:56,204][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:57:56,863][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:57:57,520][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:57:58,178][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:57:58,836][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:57:59,496][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:58:00,154][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:58:00,814][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:58:01,474][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:58:02,132][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:58:02,790][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:58:03,448][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:58:04,106][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:58:04,826][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 21:58:06,183][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:58:06,186][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:58:06,188][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:58:07,721][__main__][INFO] - Iteration 512 took 52s (8.84% Gen, 88.23% Train). Generation: 4s, Training: 46s. Estimated remaining time: 6h 41m 8s. Estimated total time: 14h 33m 10s. Time estimates for 10 more iterations: 8m 43s, 100 more iterations: 1h 27m 19s, 500 more iterations: 7h 16m 35s. [2026-03-25 21:58:07,724][__main__][INFO] - Starting iteration 512. [2026-03-25 21:58:07,730][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 21:58:07,731][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:58:12,837][__main__][INFO] - Number of regex retries in iteration 512: 0 [2026-03-25 21:58:12,837][__main__][INFO] - agents played in iteration 512 are Bob, Alice [2026-03-25 21:58:13,433][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:58:13,494][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:58:13,495][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:58:13,496][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:58:14,248][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:58:14,864][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:58:15,523][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:58:16,180][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:58:16,838][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:58:17,496][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:58:18,156][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:58:18,815][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:58:19,475][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:58:20,133][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:58:20,791][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:58:21,449][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:58:22,108][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:58:22,768][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:58:23,427][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:58:24,086][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:58:24,746][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:58:25,404][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:58:26,062][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:58:26,720][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:58:27,379][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:58:28,036][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:58:28,695][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:58:29,352][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:58:30,012][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:58:30,670][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:58:31,330][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:58:31,988][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:58:32,646][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:58:33,304][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:58:33,961][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:58:34,619][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:58:35,277][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:58:35,935][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:58:36,593][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:58:37,252][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:58:37,909][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:58:38,570][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:58:39,228][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:58:39,887][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:58:40,544][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:58:41,202][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:58:41,860][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:58:42,518][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:58:43,176][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:58:43,834][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:58:44,492][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:58:45,149][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:58:46,133][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:58:46,795][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:58:47,453][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:58:48,112][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:58:48,771][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:58:49,429][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:58:50,087][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:58:50,746][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:58:51,404][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:58:52,062][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:58:52,720][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:58:53,377][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:58:54,035][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:58:54,693][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:58:55,351][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:58:56,009][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:58:56,667][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:58:57,451][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 21:58:59,768][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:58:59,771][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:58:59,772][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:59:01,270][__main__][INFO] - Iteration 513 took 53s (9.54% Gen, 87.66% Train). Generation: 5s, Training: 46s. Estimated remaining time: 6h 59m 26s. Estimated total time: 14h 52m 22s. Time estimates for 10 more iterations: 8m 55s, 100 more iterations: 1h 29m 14s, 500 more iterations: 7h 26m 11s. [2026-03-25 21:59:01,272][__main__][INFO] - Starting iteration 513. [2026-03-25 21:59:01,276][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 21:59:01,277][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:59:06,221][__main__][INFO] - Number of regex retries in iteration 513: 0 [2026-03-25 21:59:06,222][__main__][INFO] - agents played in iteration 513 are Bob, Alice [2026-03-25 21:59:06,719][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:59:06,780][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 21:59:06,780][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:59:06,781][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:59:07,634][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:59:08,243][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:59:08,902][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:59:09,561][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:59:10,220][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:59:10,878][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:59:11,536][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:59:12,195][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:59:12,853][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:59:13,511][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:59:14,170][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:59:14,827][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:59:15,485][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:59:16,143][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:59:16,800][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:59:17,458][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:59:18,117][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:59:18,777][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:59:19,436][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:59:20,096][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:59:20,754][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:59:21,411][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:59:22,069][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:59:22,727][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:59:23,384][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:59:24,042][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:59:24,701][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:59:25,359][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:59:26,018][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:59:26,680][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:59:27,339][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:59:27,999][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:59:28,657][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:59:29,315][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:59:29,974][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:59:30,633][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:59:31,294][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:59:31,952][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:59:32,611][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:59:33,271][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:59:33,929][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:59:34,587][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:59:35,245][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:59:35,902][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:59:36,561][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:59:37,219][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:59:37,877][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:59:38,536][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:59:39,524][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:59:40,183][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:59:40,840][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:59:41,499][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:59:42,157][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:59:42,816][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:59:43,474][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:59:44,131][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:59:44,789][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:59:45,447][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:59:46,106][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:59:46,765][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:59:47,425][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:59:48,083][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:59:48,740][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:59:49,398][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:59:50,056][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:59:50,860][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 21:59:52,182][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:59:52,186][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:59:52,187][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:59:56,518][__main__][INFO] - Iteration 514 took 55s (8.95% Gen, 83.20% Train). Generation: 4s, Training: 45s. Estimated remaining time: 7h 26m 53s. Estimated total time: 15h 20m 44s. Time estimates for 10 more iterations: 9m 12s, 100 more iterations: 1h 32m 4s, 500 more iterations: 7h 40m 22s. [2026-03-25 21:59:56,521][__main__][INFO] - Starting iteration 514. [2026-03-25 21:59:56,524][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 21:59:56,524][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:00:01,386][__main__][INFO] - Number of regex retries in iteration 514: 0 [2026-03-25 22:00:01,387][__main__][INFO] - agents played in iteration 514 are Bob, Alice [2026-03-25 22:00:01,992][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:00:02,054][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:00:02,054][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:00:02,055][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:00:02,882][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:00:03,502][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:00:04,161][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:00:04,819][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:00:05,477][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:00:06,137][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:00:06,795][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:00:07,454][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:00:08,112][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:00:08,771][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:00:09,428][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:00:10,088][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:00:10,746][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:00:11,403][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:00:12,061][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:00:12,720][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:00:13,380][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:00:14,039][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:00:14,697][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:00:15,356][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:00:16,013][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:00:16,671][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:00:17,328][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:00:17,986][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:00:18,643][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:00:19,302][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:00:19,960][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:00:20,618][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:00:21,276][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:00:21,934][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:00:22,592][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:00:23,250][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:00:23,908][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:00:24,566][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:00:25,223][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:00:25,881][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:00:26,538][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:00:27,196][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:00:27,855][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:00:28,513][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:00:29,170][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:00:29,828][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:00:30,486][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:00:31,145][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:00:31,804][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:00:32,463][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:00:33,121][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:00:33,779][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:00:34,778][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:00:35,437][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:00:36,095][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:00:36,755][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:00:37,413][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:00:38,071][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:00:38,730][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:00:39,387][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:00:40,046][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:00:40,704][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:00:41,362][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:00:42,022][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:00:42,680][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:00:43,338][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:00:43,996][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:00:44,654][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:00:45,313][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:00:46,104][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 22:00:47,402][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:00:47,405][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:00:47,406][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:00:48,926][__main__][INFO] - Iteration 515 took 52s (9.28% Gen, 87.82% Train). Generation: 4s, Training: 46s. Estimated remaining time: 6h 38m 39s. Estimated total time: 14h 33m 23s. Time estimates for 10 more iterations: 8m 44s, 100 more iterations: 1h 27m 20s, 500 more iterations: 7h 16m 41s. [2026-03-25 22:00:48,928][__main__][INFO] - Starting iteration 515. [2026-03-25 22:00:48,932][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 22:00:48,933][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:00:55,456][__main__][INFO] - Number of regex retries in iteration 515: 0 [2026-03-25 22:00:55,457][__main__][INFO] - agents played in iteration 515 are Bob, Alice [2026-03-25 22:00:55,955][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:00:56,016][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:00:56,017][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:00:56,018][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:00:56,768][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:00:57,372][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:00:58,034][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:00:58,692][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:00:59,350][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:01:00,009][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:01:00,668][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:01:01,326][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:01:01,985][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:01:02,643][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:01:03,301][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:01:03,960][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:01:04,617][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:01:05,275][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:01:05,933][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:01:06,596][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:01:07,253][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:01:07,914][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:01:08,572][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:01:09,230][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:01:09,889][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:01:10,548][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:01:11,206][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:01:11,864][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:01:12,526][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:01:13,184][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:01:13,842][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:01:14,500][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:01:15,159][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:01:15,817][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:01:16,475][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:01:17,132][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:01:17,790][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:01:18,449][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:01:19,108][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:01:19,768][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:01:20,426][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:01:21,086][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:01:21,744][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:01:22,403][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:01:23,060][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:01:23,719][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:01:24,379][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:01:25,038][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:01:25,696][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:01:26,356][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:01:27,015][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:01:27,674][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:01:28,691][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:01:29,351][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:01:30,009][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:01:30,668][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:01:31,328][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:01:31,987][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:01:32,644][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:01:33,303][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:01:33,964][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:01:34,622][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:01:35,281][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:01:35,940][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:01:36,600][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:01:37,258][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:01:37,916][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:01:38,574][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:01:39,232][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:01:40,041][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 22:01:41,419][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:01:41,422][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:01:41,423][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:01:42,812][__main__][INFO] - Iteration 516 took 53s (12.11% Gen, 85.31% Train). Generation: 6s, Training: 45s. Estimated remaining time: 7h 2m 24s. Estimated total time: 14h 58m 1s. Time estimates for 10 more iterations: 8m 58s, 100 more iterations: 1h 29m 48s, 500 more iterations: 7h 29m 0s. [2026-03-25 22:01:42,814][__main__][INFO] - Starting iteration 516. [2026-03-25 22:01:42,818][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 22:01:42,818][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:01:47,623][__main__][INFO] - Number of regex retries in iteration 516: 0 [2026-03-25 22:01:47,624][__main__][INFO] - agents played in iteration 516 are Bob, Alice [2026-03-25 22:01:48,148][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:01:48,208][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:01:48,209][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:01:48,209][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:01:48,879][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:01:49,483][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:01:50,142][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:01:50,802][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:01:51,460][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:01:52,119][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:01:52,777][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:01:53,435][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:01:54,093][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:01:54,750][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:01:55,409][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:01:56,067][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:01:56,726][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:01:57,384][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:01:58,041][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:01:58,699][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:01:59,356][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:02:00,015][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:02:00,673][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:02:01,333][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:02:01,991][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:02:02,648][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:02:03,308][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:02:03,966][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:02:04,624][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:02:05,281][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:02:05,940][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:02:06,599][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:02:07,255][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:02:07,912][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:02:08,570][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:02:09,228][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:02:09,886][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:02:10,545][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:02:11,204][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:02:11,863][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:02:12,522][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:02:13,181][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:02:13,840][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:02:14,499][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:02:15,157][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:02:15,816][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:02:16,475][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:02:17,133][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:02:17,792][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:02:18,449][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:02:19,107][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:02:19,765][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:02:20,754][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:02:21,414][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:02:22,071][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:02:22,729][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:02:23,387][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:02:24,046][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:02:24,704][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:02:25,362][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:02:26,019][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:02:26,676][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:02:27,336][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:02:27,994][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:02:28,650][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:02:29,309][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:02:29,969][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:02:30,627][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:02:31,286][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:02:32,115][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 22:02:33,604][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:02:33,607][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:02:33,608][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:02:35,076][__main__][INFO] - Iteration 517 took 52s (9.19% Gen, 87.99% Train). Generation: 4s, Training: 45s. Estimated remaining time: 6h 34m 30s. Estimated total time: 14h 31m 0s. Time estimates for 10 more iterations: 8m 42s, 100 more iterations: 1h 27m 6s, 500 more iterations: 7h 15m 30s. [2026-03-25 22:02:35,079][__main__][INFO] - Starting iteration 517. [2026-03-25 22:02:35,083][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 22:02:35,083][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:02:39,830][__main__][INFO] - Number of regex retries in iteration 517: 0 [2026-03-25 22:02:39,831][__main__][INFO] - agents played in iteration 517 are Bob, Alice [2026-03-25 22:02:40,419][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:02:40,479][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:02:40,480][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:02:40,480][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:02:41,146][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:02:41,752][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:02:42,412][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:02:43,070][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:02:43,729][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:02:44,386][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:02:45,048][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:02:45,706][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:02:46,364][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:02:47,023][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:02:47,681][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:02:48,340][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:02:48,998][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:02:49,655][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:02:50,314][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:02:50,972][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:02:51,630][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:02:52,290][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:02:52,948][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:02:53,606][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:02:54,263][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:02:54,920][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:02:55,578][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:02:56,237][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:02:56,897][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:02:57,557][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:02:58,216][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:02:58,874][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:02:59,532][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:03:00,190][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:03:00,848][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:03:01,507][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:03:02,166][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:03:02,823][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:03:03,481][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:03:04,139][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:03:04,797][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:03:05,455][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:03:06,113][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:03:06,770][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:03:07,428][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:03:08,086][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:03:08,743][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:03:09,401][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:03:10,064][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:03:10,723][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:03:11,382][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:03:12,040][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:03:13,027][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:03:13,686][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:03:14,344][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:03:15,002][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:03:15,660][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:03:16,318][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:03:16,975][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:03:17,633][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:03:18,290][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:03:18,949][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:03:19,607][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:03:20,264][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:03:20,922][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:03:21,580][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:03:22,237][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:03:22,895][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:03:23,556][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:03:24,367][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 22:03:25,677][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:03:25,679][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:03:25,680][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:03:27,286][__main__][INFO] - Iteration 518 took 52s (9.09% Gen, 87.83% Train). Generation: 4s, Training: 45s. Estimated remaining time: 6h 32m 42s. Estimated total time: 14h 30m 4s. Time estimates for 10 more iterations: 8m 42s, 100 more iterations: 1h 27m 0s, 500 more iterations: 7h 15m 2s. [2026-03-25 22:03:27,289][__main__][INFO] - Starting iteration 518. [2026-03-25 22:03:27,292][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 22:03:27,293][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:03:31,917][__main__][INFO] - Number of regex retries in iteration 518: 0 [2026-03-25 22:03:31,917][__main__][INFO] - agents played in iteration 518 are Bob, Alice [2026-03-25 22:03:32,407][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:03:32,468][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:03:32,469][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:03:32,469][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:03:33,282][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:03:33,888][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:03:34,548][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:03:35,206][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:03:35,867][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:03:36,525][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:03:37,185][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:03:37,844][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:03:38,501][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:03:39,159][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:03:39,819][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:03:40,477][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:03:41,136][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:03:41,794][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:03:42,452][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:03:43,110][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:03:43,770][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:03:44,428][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:03:45,087][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:03:45,745][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:03:46,404][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:03:47,061][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:03:47,722][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:03:48,379][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:03:49,036][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:03:49,693][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:03:50,352][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:03:51,011][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:03:51,669][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:03:52,327][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:03:52,984][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:03:53,644][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:03:54,302][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:03:54,960][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:03:55,619][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:03:56,279][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:03:56,940][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:03:57,599][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:03:58,257][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:03:58,915][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:03:59,574][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:04:00,233][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:04:00,891][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:04:01,550][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:04:02,208][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:04:02,865][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:04:03,525][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:04:04,183][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:04:05,183][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:04:05,841][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:04:06,499][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:04:07,158][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:04:07,816][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:04:08,475][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:04:09,133][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:04:09,791][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:04:10,449][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:04:11,108][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:04:11,766][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:04:12,424][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:04:13,082][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:04:13,739][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:04:14,397][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:04:15,055][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:04:15,712][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:04:16,507][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 22:04:17,850][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:04:17,853][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:04:17,854][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:04:19,350][__main__][INFO] - Iteration 519 took 52s (8.88% Gen, 88.24% Train). Generation: 4s, Training: 45s. Estimated remaining time: 6h 29m 26s. Estimated total time: 14h 27m 40s. Time estimates for 10 more iterations: 8m 40s, 100 more iterations: 1h 26m 46s, 500 more iterations: 7h 13m 50s. [2026-03-25 22:04:19,353][__main__][INFO] - Starting iteration 519. [2026-03-25 22:04:19,356][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 22:04:19,357][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:04:25,074][__main__][INFO] - Number of regex retries in iteration 519: 0 [2026-03-25 22:04:25,075][__main__][INFO] - agents played in iteration 519 are Bob, Alice [2026-03-25 22:04:25,572][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:04:25,633][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:04:25,634][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:04:25,634][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:04:26,398][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:04:27,032][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:04:27,671][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:04:28,330][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:04:28,988][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:04:29,646][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:04:30,304][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:04:30,961][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:04:31,622][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:04:32,281][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:04:32,939][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:04:33,596][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:04:34,254][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:04:34,913][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:04:35,570][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:04:36,228][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:04:36,885][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:04:37,542][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:04:38,201][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:04:38,858][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:04:39,516][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:04:40,176][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:04:40,835][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:04:41,494][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:04:42,152][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:04:42,810][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:04:43,468][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:04:44,125][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:04:44,783][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:04:45,441][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:04:46,098][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:04:46,756][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:04:47,414][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:04:48,072][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:04:48,729][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:04:49,387][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:04:50,046][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:04:50,705][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:04:51,363][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:04:52,021][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:04:52,680][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:04:53,337][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:04:53,995][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:04:54,653][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:04:55,310][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:04:55,968][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:04:56,626][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:04:57,284][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:04:58,269][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:04:58,927][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:04:59,585][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:05:00,242][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:05:00,901][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:05:01,560][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:05:02,218][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:05:02,876][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:05:03,535][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:05:04,193][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:05:04,850][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:05:05,508][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:05:06,166][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:05:06,829][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:05:07,485][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:05:08,143][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:05:08,801][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:05:09,609][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 22:05:10,935][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:05:10,938][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:05:10,939][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:05:12,497][__main__][INFO] - Iteration 520 took 53s (10.76% Gen, 86.30% Train). Generation: 5s, Training: 45s. Estimated remaining time: 6h 46m 35s. Estimated total time: 14h 45m 42s. Time estimates for 10 more iterations: 8m 51s, 100 more iterations: 1h 28m 34s, 500 more iterations: 7h 22m 51s. [2026-03-25 22:05:12,499][__main__][INFO] - Starting iteration 520. [2026-03-25 22:05:12,503][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 22:05:12,504][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:05:18,009][__main__][INFO] - Number of regex retries in iteration 520: 0 [2026-03-25 22:05:18,010][__main__][INFO] - agents played in iteration 520 are Bob, Alice [2026-03-25 22:05:18,601][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:05:18,663][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:05:18,664][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:05:18,664][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:05:19,411][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:05:20,026][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:05:20,686][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:05:21,347][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:05:22,005][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:05:22,663][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:05:23,323][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:05:23,981][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:05:24,639][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:05:25,297][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:05:25,954][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:05:26,612][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:05:27,271][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:05:27,929][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:05:28,587][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:05:29,245][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:05:29,903][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:05:30,562][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:05:31,220][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:05:31,880][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:05:32,538][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:05:33,196][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:05:33,853][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:05:34,510][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:05:35,168][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:05:35,826][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:05:36,483][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:05:37,140][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:05:37,799][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:05:38,457][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:05:39,116][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:05:39,777][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:05:40,434][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:05:41,092][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:05:41,750][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:05:42,409][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:05:43,067][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:05:43,725][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:05:44,383][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:05:45,042][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:05:45,700][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:05:46,358][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:05:47,015][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:05:47,672][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:05:48,330][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:05:48,987][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:05:49,645][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:05:50,302][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:05:51,292][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:05:51,952][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:05:52,611][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:05:53,269][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:05:53,927][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:05:54,586][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:05:55,244][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:05:55,902][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:05:56,560][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:05:57,217][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:05:57,876][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:05:58,535][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:05:59,193][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:05:59,852][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:06:00,510][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:06:01,168][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:06:01,827][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:06:02,610][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 22:06:03,983][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:06:03,986][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:06:03,987][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:06:05,475][__main__][INFO] - Iteration 521 took 52s (10.39% Gen, 86.79% Train). Generation: 5s, Training: 45s. Estimated remaining time: 6h 42m 53s. Estimated total time: 14h 42m 53s. Time estimates for 10 more iterations: 8m 49s, 100 more iterations: 1h 28m 17s, 500 more iterations: 7h 21m 26s. [2026-03-25 22:06:05,477][__main__][INFO] - Starting iteration 521. [2026-03-25 22:06:05,481][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 22:06:05,481][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:06:13,924][__main__][INFO] - Number of regex retries in iteration 521: 0 [2026-03-25 22:06:13,932][__main__][INFO] - agents played in iteration 521 are Bob, Alice [2026-03-25 22:06:14,442][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:06:14,839][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:06:14,840][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:06:14,841][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:06:15,538][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:06:16,153][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:06:16,812][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:06:17,471][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:06:18,130][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:06:18,788][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:06:19,446][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:06:20,105][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:06:20,762][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:06:22,878][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:06:23,537][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:06:24,194][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:06:24,851][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:06:25,510][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:06:26,168][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:06:26,825][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:06:27,483][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:06:28,141][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:06:28,799][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:06:29,456][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:06:30,114][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:06:30,771][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:06:31,429][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:06:32,087][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:06:32,745][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:06:33,403][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:06:34,061][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:06:34,719][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:06:35,377][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:06:36,034][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:06:36,691][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:06:37,353][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:06:38,011][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:06:38,668][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:06:39,326][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:06:39,984][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:06:40,642][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:06:41,300][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:06:41,957][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:06:42,616][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:06:43,274][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:06:43,932][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:06:44,589][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:06:45,247][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:06:45,905][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:06:46,563][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:06:47,220][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:06:47,879][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:06:48,864][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:06:49,522][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:06:50,181][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:06:50,839][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:06:51,497][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:06:52,154][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:06:52,811][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:06:53,470][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:06:54,128][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:06:54,788][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:06:55,446][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:06:56,104][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:06:56,762][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:06:57,419][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:06:58,081][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:06:58,743][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:06:59,404][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:07:00,563][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:45 [2026-03-25 22:07:01,952][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:07:01,955][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:07:01,957][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:07:03,499][__main__][INFO] - Iteration 522 took 58s (14.57% Gen, 82.77% Train). Generation: 8s, Training: 48s. Estimated remaining time: 8h 6m 1s. Estimated total time: 16h 6m 59s. Time estimates for 10 more iterations: 9m 40s, 100 more iterations: 1h 36m 41s, 500 more iterations: 8h 3m 29s. [2026-03-25 22:07:03,501][__main__][INFO] - Starting iteration 522. [2026-03-25 22:07:03,506][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 22:07:03,506][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:07:12,913][__main__][INFO] - Number of regex retries in iteration 522: 0 [2026-03-25 22:07:12,914][__main__][INFO] - agents played in iteration 522 are Bob, Alice [2026-03-25 22:07:13,405][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:07:13,467][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:07:13,468][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:07:13,468][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:07:14,226][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:07:14,829][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:07:15,488][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:07:16,150][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:07:16,810][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:07:17,471][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:07:18,130][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:07:18,788][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:07:19,446][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:07:20,104][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:07:20,761][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:07:21,418][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:07:22,075][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:07:22,734][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:07:23,391][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:07:24,049][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:07:24,706][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:07:25,364][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:07:26,022][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:07:26,680][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:07:27,337][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:07:27,995][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:07:28,652][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:07:29,310][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:07:29,968][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:07:30,626][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:07:31,285][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:07:31,944][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:07:32,602][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:07:33,260][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:07:33,917][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:07:34,574][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:07:35,232][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:07:35,890][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:07:36,549][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:07:37,207][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:07:37,864][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:07:38,521][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:07:39,179][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:07:39,838][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:07:40,496][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:07:41,154][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:07:41,811][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:07:42,470][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:07:43,127][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:07:43,786][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:07:44,444][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:07:45,102][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:07:46,163][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:07:46,822][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:07:47,482][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:07:48,139][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:07:48,798][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:07:49,460][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:07:50,118][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:07:50,775][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:07:51,433][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:07:52,090][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:07:52,749][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:07:53,406][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:07:54,065][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:07:54,722][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:07:55,380][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:07:56,038][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:07:56,696][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:07:57,554][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 22:07:58,918][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:07:58,921][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:07:58,922][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:08:00,345][__main__][INFO] - Iteration 523 took 56s (16.55% Gen, 80.94% Train). Generation: 9s, Training: 46s. Estimated remaining time: 7h 45m 26s. Estimated total time: 15h 47m 21s. Time estimates for 10 more iterations: 9m 28s, 100 more iterations: 1h 34m 44s, 500 more iterations: 7h 53m 40s. [2026-03-25 22:08:00,347][__main__][INFO] - Starting iteration 523. [2026-03-25 22:08:00,350][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 22:08:00,351][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:08:06,083][__main__][INFO] - Number of regex retries in iteration 523: 0 [2026-03-25 22:08:06,084][__main__][INFO] - agents played in iteration 523 are Bob, Alice [2026-03-25 22:08:06,685][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:08:06,746][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:08:06,747][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:08:06,747][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:08:07,396][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:08:08,013][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:08:08,672][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:08:09,332][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:08:09,992][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:08:10,651][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:08:11,310][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:08:11,969][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:08:12,629][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:08:13,288][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:08:13,951][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:08:14,609][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:08:15,270][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:08:15,929][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:08:16,587][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:08:17,245][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:08:17,904][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:08:18,563][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:08:19,222][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:08:19,880][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:08:20,539][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:08:21,200][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:08:21,858][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:08:22,517][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:08:23,175][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:08:23,834][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:08:24,493][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:08:25,151][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:08:25,810][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:08:26,469][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:08:27,128][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:08:27,787][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:08:28,450][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:08:29,110][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:08:29,772][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:08:30,431][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:08:31,090][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:08:31,748][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:08:32,408][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:08:33,067][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:08:33,726][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:08:34,384][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:08:35,043][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:08:35,702][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:08:36,362][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:08:37,020][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:08:37,679][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:08:38,338][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:08:39,327][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:08:39,986][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:08:40,644][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:08:41,302][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:08:41,960][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:08:42,618][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:08:43,275][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:08:43,934][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:08:44,591][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:08:45,249][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:08:45,906][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:08:46,564][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:08:47,222][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:08:47,879][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:08:48,536][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:08:49,194][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:08:49,853][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:08:50,656][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 22:08:52,496][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:08:52,498][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:08:52,500][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:08:53,968][__main__][INFO] - Iteration 524 took 53s (10.69% Gen, 86.56% Train). Generation: 5s, Training: 46s. Estimated remaining time: 6h 50m 51s. Estimated total time: 14h 53m 39s. Time estimates for 10 more iterations: 8m 56s, 100 more iterations: 1h 29m 21s, 500 more iterations: 7h 26m 49s. [2026-03-25 22:08:53,970][__main__][INFO] - Starting iteration 524. [2026-03-25 22:08:53,975][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 22:08:53,976][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:08:59,189][__main__][INFO] - Number of regex retries in iteration 524: 0 [2026-03-25 22:08:59,191][__main__][INFO] - agents played in iteration 524 are Bob, Alice [2026-03-25 22:08:59,688][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:08:59,749][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:08:59,750][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:08:59,750][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:09:00,405][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:09:01,015][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:09:01,674][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:09:02,331][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:09:02,990][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:09:03,646][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:09:04,305][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:09:04,962][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:09:05,620][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:09:06,277][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:09:06,935][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:09:07,594][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:09:08,251][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:09:08,908][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:09:09,565][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:09:10,223][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:09:10,883][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:09:11,541][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:09:12,199][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:09:12,857][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:09:13,514][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:09:14,171][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:09:14,829][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:09:15,486][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:09:16,144][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:09:16,801][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:09:17,459][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:09:18,119][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:09:18,781][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:09:19,439][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:09:20,100][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:09:20,758][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:09:21,417][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:09:22,075][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:09:22,733][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:09:23,392][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:09:24,050][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:09:24,708][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:09:25,367][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:09:26,024][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:09:26,682][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:09:27,340][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:09:27,999][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:09:28,657][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:09:29,315][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:09:29,973][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:09:30,630][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:09:31,289][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:09:32,271][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:09:32,929][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:09:33,587][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:09:34,245][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:09:34,902][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:09:35,560][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:09:36,218][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:09:36,876][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:09:37,534][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:09:38,192][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:09:38,851][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:09:39,511][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:09:40,170][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:09:40,830][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:09:41,487][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:09:42,145][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:09:42,802][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:09:43,571][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 22:09:44,939][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:09:44,941][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:09:44,942][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:09:46,453][__main__][INFO] - Iteration 525 took 52s (9.94% Gen, 87.18% Train). Generation: 5s, Training: 45s. Estimated remaining time: 6h 30m 58s. Estimated total time: 14h 34m 39s. Time estimates for 10 more iterations: 8m 44s, 100 more iterations: 1h 27m 27s, 500 more iterations: 7h 17m 19s. [2026-03-25 22:09:46,455][__main__][INFO] - Starting iteration 525. [2026-03-25 22:09:46,460][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 22:09:46,461][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:09:53,970][__main__][INFO] - Number of regex retries in iteration 525: 0 [2026-03-25 22:09:53,971][__main__][INFO] - agents played in iteration 525 are Bob, Alice [2026-03-25 22:09:54,833][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:09:54,894][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:09:54,894][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:09:54,895][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:09:55,662][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:09:56,279][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:09:56,938][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:09:57,597][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:09:58,256][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:09:58,914][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:09:59,572][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:10:00,231][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:10:00,891][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:10:01,551][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:10:02,211][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:10:02,868][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:10:03,526][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:10:04,184][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:10:04,841][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:10:05,499][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:10:06,156][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:10:06,814][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:10:07,472][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:10:08,130][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:10:08,789][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:10:09,447][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:10:10,105][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:10:10,763][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:10:11,421][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:10:12,080][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:10:12,738][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:10:13,397][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:10:14,055][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:10:14,712][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:10:15,369][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:10:16,028][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:10:16,685][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:10:17,342][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:10:18,000][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:10:18,658][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:10:19,315][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:10:19,973][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:10:20,630][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:10:21,287][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:10:21,945][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:10:22,603][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:10:23,261][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:10:23,919][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:10:24,578][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:10:25,235][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:10:25,893][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:10:26,551][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:10:27,535][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:10:28,194][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:10:28,852][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:10:29,510][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:10:30,168][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:10:30,826][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:10:31,483][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:10:32,142][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:10:32,801][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:10:33,458][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:10:34,117][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:10:34,775][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:10:35,435][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:10:36,091][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:10:36,749][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:10:37,410][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:10:38,068][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:10:38,797][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 22:10:40,166][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:10:40,169][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:10:40,170][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:10:41,559][__main__][INFO] - Iteration 526 took 55s (13.63% Gen, 83.84% Train). Generation: 7s, Training: 46s. Estimated remaining time: 7h 13m 46s. Estimated total time: 15h 18m 22s. Time estimates for 10 more iterations: 9m 11s, 100 more iterations: 1h 31m 50s, 500 more iterations: 7h 39m 11s. [2026-03-25 22:10:41,561][__main__][INFO] - Starting iteration 526. [2026-03-25 22:10:41,565][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 22:10:41,565][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:10:47,583][__main__][INFO] - Number of regex retries in iteration 526: 0 [2026-03-25 22:10:47,585][__main__][INFO] - agents played in iteration 526 are Bob, Alice [2026-03-25 22:10:48,092][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:10:48,154][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:10:48,154][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:10:48,155][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:10:48,930][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:10:49,539][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:10:50,201][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:10:50,859][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:10:51,518][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:10:52,176][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:10:52,834][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:10:53,493][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:10:54,151][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:10:54,808][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:10:55,467][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:10:56,124][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:10:56,783][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:10:57,440][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:10:58,097][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:10:58,756][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:10:59,414][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:11:00,072][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:11:00,729][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:11:01,388][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:11:02,046][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:11:02,703][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:11:03,360][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:11:04,018][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:11:04,676][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:11:05,334][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:11:05,992][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:11:06,650][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:11:07,307][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:11:07,964][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:11:08,622][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:11:09,280][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:11:09,938][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:11:10,596][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:11:11,253][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:11:11,911][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:11:12,568][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:11:13,226][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:11:13,883][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:11:14,542][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:11:15,199][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:11:15,858][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:11:16,516][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:11:17,174][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:11:17,832][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:11:18,491][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:11:19,149][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:11:19,807][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:11:20,803][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:11:21,464][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:11:22,120][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:11:22,779][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:11:23,436][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:11:24,096][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:11:24,753][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:11:25,413][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:11:26,070][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:11:26,728][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:11:27,387][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:11:28,044][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:11:28,704][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:11:29,363][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:11:30,021][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:11:30,680][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:11:31,338][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:11:32,063][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 22:11:33,391][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:11:33,394][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:11:33,395][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:11:34,842][__main__][INFO] - Iteration 527 took 53s (11.30% Gen, 85.98% Train). Generation: 6s, Training: 45s. Estimated remaining time: 6h 42m 29s. Estimated total time: 14h 47m 58s. Time estimates for 10 more iterations: 8m 52s, 100 more iterations: 1h 28m 47s, 500 more iterations: 7h 23m 59s. [2026-03-25 22:11:34,845][__main__][INFO] - Starting iteration 527. [2026-03-25 22:11:34,850][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 22:11:34,850][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:11:39,718][__main__][INFO] - Number of regex retries in iteration 527: 0 [2026-03-25 22:11:39,719][__main__][INFO] - agents played in iteration 527 are Bob, Alice [2026-03-25 22:11:40,303][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:11:40,364][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:11:40,365][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:11:40,365][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:11:41,086][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:11:41,706][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:11:42,366][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:11:43,025][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:11:43,685][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:11:44,344][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:11:45,003][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:11:45,661][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:11:46,321][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:11:46,980][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:11:47,642][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:11:48,303][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:11:48,962][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:11:49,621][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:11:50,280][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:11:50,940][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:11:51,599][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:11:52,259][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:11:52,920][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:11:53,579][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:11:54,237][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:11:54,897][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:11:55,557][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:11:56,217][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:11:56,877][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:11:57,540][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:11:58,199][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:11:58,859][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:11:59,518][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:12:00,177][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:12:00,836][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:12:01,496][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:12:02,156][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:12:02,816][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:12:03,475][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:12:04,135][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:12:04,794][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:12:05,454][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:12:06,113][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:12:06,773][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:12:07,432][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:12:08,090][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:12:08,748][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:12:09,407][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:12:10,066][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:12:10,724][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:12:11,383][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:12:12,042][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:12:13,032][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:12:13,691][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:12:14,350][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:12:15,008][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:12:15,667][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:12:16,327][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:12:16,985][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:12:17,644][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:12:18,305][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:12:18,963][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:12:19,622][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:12:20,281][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:12:20,940][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:12:21,600][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:12:22,259][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:12:22,918][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:12:23,576][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:12:24,381][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 22:12:25,701][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:12:25,703][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:12:25,704][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:12:27,337][__main__][INFO] - Iteration 528 took 52s (9.28% Gen, 87.61% Train). Generation: 4s, Training: 45s. Estimated remaining time: 6h 28m 27s. Estimated total time: 14h 34m 49s. Time estimates for 10 more iterations: 8m 44s, 100 more iterations: 1h 27m 28s, 500 more iterations: 7h 17m 24s. [2026-03-25 22:12:27,339][__main__][INFO] - Starting iteration 528. [2026-03-25 22:12:27,342][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 22:12:27,343][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:12:31,974][__main__][INFO] - Number of regex retries in iteration 528: 0 [2026-03-25 22:12:31,975][__main__][INFO] - agents played in iteration 528 are Bob, Alice [2026-03-25 22:12:32,472][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:12:32,534][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:12:32,535][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:12:32,535][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:12:33,275][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:12:33,890][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:12:34,549][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:12:35,207][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:12:35,864][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:12:36,523][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:12:37,180][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:12:37,840][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:12:38,497][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:12:39,155][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:12:39,813][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:12:40,472][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:12:41,130][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:12:41,788][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:12:42,446][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:12:43,106][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:12:43,766][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:12:44,423][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:12:45,082][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:12:45,740][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:12:46,397][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:12:47,054][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:12:47,714][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:12:48,373][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:12:49,031][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:12:49,689][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:12:50,347][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:12:51,005][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:12:51,662][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:12:52,320][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:12:52,977][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:12:53,634][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:12:54,292][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:12:54,949][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:12:55,607][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:12:56,265][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:12:56,923][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:12:57,581][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:12:58,241][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:12:58,900][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:12:59,558][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:13:00,215][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:13:00,874][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:13:01,531][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:13:02,190][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:13:02,848][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:13:03,505][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:13:04,163][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:13:05,143][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:13:05,802][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:13:06,459][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:13:07,117][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:13:07,774][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:13:08,434][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:13:09,092][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:13:09,751][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:13:10,409][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:13:11,067][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:13:11,725][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:13:12,382][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:13:13,040][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:13:13,700][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:13:14,358][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:13:15,015][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:13:15,673][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:13:16,456][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 22:13:17,800][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:13:17,802][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:13:17,803][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:13:19,387][__main__][INFO] - Iteration 529 took 52s (8.90% Gen, 88.05% Train). Generation: 4s, Training: 45s. Estimated remaining time: 6h 20m 12s. Estimated total time: 14h 27m 26s. Time estimates for 10 more iterations: 8m 40s, 100 more iterations: 1h 26m 44s, 500 more iterations: 7h 13m 43s. [2026-03-25 22:13:19,389][__main__][INFO] - Starting iteration 529. [2026-03-25 22:13:19,393][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 22:13:19,393][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:13:25,382][__main__][INFO] - Number of regex retries in iteration 529: 0 [2026-03-25 22:13:25,383][__main__][INFO] - agents played in iteration 529 are Bob, Alice [2026-03-25 22:13:26,297][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:13:26,358][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:13:26,359][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:13:26,359][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:13:27,180][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:13:27,795][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:13:28,454][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:13:29,111][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:13:29,770][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:13:30,428][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:13:31,086][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:13:31,746][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:13:32,404][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:13:33,063][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:13:33,721][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:13:34,379][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:13:35,037][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:13:35,694][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:13:36,351][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:13:37,010][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:13:37,669][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:13:38,327][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:13:38,984][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:13:39,641][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:13:40,303][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:13:40,962][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:13:41,620][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:13:42,277][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:13:42,935][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:13:43,592][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:13:44,251][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:13:44,909][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:13:45,567][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:13:46,227][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:13:46,885][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:13:47,543][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:13:48,201][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:13:48,860][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:13:49,518][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:13:50,175][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:13:50,833][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:13:51,491][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:13:52,148][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:13:52,806][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:13:53,464][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:13:54,121][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:13:56,134][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:13:56,792][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:13:57,449][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:13:58,106][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:13:58,764][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:13:59,422][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:14:00,402][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:14:01,061][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:14:01,719][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:14:02,376][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:14:03,035][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:14:03,693][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:14:04,351][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:14:05,008][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:14:05,666][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:14:06,324][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:14:06,981][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:14:07,639][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:14:08,299][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:14:08,956][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:14:09,615][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:14:10,272][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:14:10,930][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:14:11,702][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:44 [2026-03-25 22:14:13,062][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:14:13,065][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:14:13,066][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:14:14,569][__main__][INFO] - Iteration 530 took 55s (10.86% Gen, 86.42% Train). Generation: 5s, Training: 47s. Estimated remaining time: 7h 11m 28s. Estimated total time: 15h 19m 37s. Time estimates for 10 more iterations: 9m 11s, 100 more iterations: 1h 31m 57s, 500 more iterations: 7h 39m 48s. [2026-03-25 22:14:14,571][__main__][INFO] - Starting iteration 530. [2026-03-25 22:14:14,575][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 22:14:14,575][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:14:21,452][__main__][INFO] - Number of regex retries in iteration 530: 0 [2026-03-25 22:14:21,454][__main__][INFO] - agents played in iteration 530 are Bob, Alice [2026-03-25 22:14:21,954][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:14:22,016][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:14:22,016][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:14:22,017][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:14:22,804][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:14:23,445][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:14:24,079][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:14:24,736][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:14:25,394][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:14:26,051][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:14:26,710][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:14:27,369][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:14:28,028][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:14:28,687][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:14:29,345][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:14:30,003][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:14:30,661][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:14:31,319][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:14:31,976][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:14:32,635][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:14:33,292][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:14:33,950][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:14:34,607][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:14:35,266][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:14:35,924][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:14:36,581][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:14:37,242][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:14:37,899][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:14:38,557][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:14:39,215][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:14:39,872][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:14:40,530][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:14:41,187][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:14:41,845][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:14:42,503][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:14:43,162][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:14:43,820][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:14:44,477][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:14:45,134][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:14:45,791][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:14:46,450][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:14:47,107][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:14:47,765][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:14:48,423][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:14:49,082][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:14:49,739][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:14:50,397][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:14:51,055][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:14:51,712][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:14:52,370][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:14:53,028][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:14:53,685][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:14:54,679][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:14:55,341][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:14:55,999][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:14:56,658][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:14:57,317][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:14:57,975][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:14:58,634][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:14:59,292][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:14:59,950][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:15:00,609][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:15:01,267][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:15:01,925][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:15:02,584][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:15:03,243][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:15:03,901][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:15:04,560][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:15:05,219][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:15:05,994][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 22:15:07,389][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:15:07,392][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:15:07,393][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:15:08,806][__main__][INFO] - Iteration 531 took 54s (12.68% Gen, 84.71% Train). Generation: 6s, Training: 45s. Estimated remaining time: 6h 54m 50s. Estimated total time: 15h 3m 53s. Time estimates for 10 more iterations: 9m 2s, 100 more iterations: 1h 30m 23s, 500 more iterations: 7h 31m 56s. [2026-03-25 22:15:08,809][__main__][INFO] - Starting iteration 531. [2026-03-25 22:15:08,813][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 22:15:08,813][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:15:14,026][__main__][INFO] - Number of regex retries in iteration 531: 0 [2026-03-25 22:15:14,028][__main__][INFO] - agents played in iteration 531 are Bob, Alice [2026-03-25 22:15:14,601][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:15:14,662][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:15:14,662][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:15:14,663][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:15:15,485][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:15:16,099][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:15:16,758][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:15:17,416][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:15:18,073][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:15:18,732][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:15:19,390][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:15:20,049][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:15:20,707][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:15:21,364][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:15:22,023][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:15:22,681][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:15:23,338][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:15:23,997][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:15:24,655][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:15:25,312][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:15:25,969][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:15:26,626][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:15:27,284][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:15:27,942][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:15:28,607][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:15:29,265][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:15:29,923][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:15:30,581][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:15:31,240][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:15:31,898][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:15:32,558][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:15:33,216][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:15:33,876][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:15:34,536][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:15:35,194][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:15:35,852][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:15:36,511][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:15:37,171][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:15:37,829][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:15:38,488][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:15:39,146][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:15:39,805][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:15:40,464][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:15:41,122][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:15:41,782][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:15:42,441][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:15:43,100][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:15:43,759][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:15:44,423][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:15:45,083][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:15:45,740][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:15:46,398][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:15:47,396][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:15:48,056][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:15:48,714][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:15:49,372][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:15:50,030][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:15:50,689][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:15:51,348][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:15:52,006][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:15:52,666][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:15:53,326][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:15:53,985][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:15:54,642][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:15:55,301][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:15:55,959][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:15:56,618][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:15:57,277][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:15:57,935][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:15:58,745][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 22:16:00,106][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:16:00,108][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:16:00,110][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:16:01,490][__main__][INFO] - Iteration 532 took 52s (9.90% Gen, 87.48% Train). Generation: 5s, Training: 46s. Estimated remaining time: 6h 28m 3s. Estimated total time: 14h 37m 59s. Time estimates for 10 more iterations: 8m 46s, 100 more iterations: 1h 27m 47s, 500 more iterations: 7h 18m 59s. [2026-03-25 22:16:01,492][__main__][INFO] - Starting iteration 532. [2026-03-25 22:16:01,496][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 22:16:01,496][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:16:06,205][__main__][INFO] - Number of regex retries in iteration 532: 0 [2026-03-25 22:16:06,206][__main__][INFO] - agents played in iteration 532 are Bob, Alice [2026-03-25 22:16:06,752][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:16:06,813][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:16:06,813][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:16:06,814][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:16:07,641][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:16:08,247][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:16:08,907][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:16:09,567][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:16:10,229][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:16:10,888][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:16:11,547][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:16:12,207][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:16:12,866][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:16:13,524][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:16:14,184][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:16:14,843][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:16:15,503][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:16:16,161][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:16:16,820][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:16:17,478][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:16:18,137][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:16:18,796][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:16:19,455][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:16:20,114][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:16:20,774][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:16:21,434][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:16:22,094][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:16:22,753][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:16:23,412][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:16:24,070][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:16:24,730][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:16:25,389][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:16:26,048][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:16:26,708][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:16:27,367][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:16:28,026][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:16:28,694][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:16:29,352][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:16:30,011][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:16:30,670][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:16:31,330][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:16:31,989][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:16:32,648][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:16:33,307][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:16:33,968][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:16:34,627][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:16:35,286][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:16:35,945][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:16:36,604][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:16:37,262][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:16:37,921][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:16:38,580][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:16:39,562][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:16:40,222][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:16:40,881][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:16:41,539][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:16:42,197][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:16:42,854][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:16:43,512][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:16:44,171][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:16:44,829][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:16:45,487][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:16:46,145][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:16:46,803][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:16:47,461][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:16:48,119][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:16:48,777][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:16:49,434][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:16:50,092][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:16:50,884][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 22:16:52,274][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:16:52,277][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:16:52,278][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:16:53,701][__main__][INFO] - Iteration 533 took 52s (9.02% Gen, 88.25% Train). Generation: 4s, Training: 46s. Estimated remaining time: 6h 19m 18s. Estimated total time: 14h 30m 7s. Time estimates for 10 more iterations: 8m 42s, 100 more iterations: 1h 27m 0s, 500 more iterations: 7h 15m 3s. [2026-03-25 22:16:53,703][__main__][INFO] - Starting iteration 533. [2026-03-25 22:16:53,707][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 22:16:53,707][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:16:58,623][__main__][INFO] - Number of regex retries in iteration 533: 0 [2026-03-25 22:16:58,625][__main__][INFO] - agents played in iteration 533 are Bob, Alice [2026-03-25 22:16:59,128][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:16:59,189][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:16:59,190][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:16:59,190][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:16:59,900][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:17:00,513][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:17:01,173][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:17:01,832][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:17:02,490][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:17:03,148][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:17:03,808][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:17:04,467][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:17:05,125][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:17:05,783][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:17:06,441][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:17:07,098][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:17:07,756][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:17:08,414][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:17:09,072][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:17:09,731][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:17:10,389][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:17:11,047][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:17:11,705][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:17:12,362][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:17:13,022][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:17:13,680][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:17:14,339][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:17:14,997][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:17:15,655][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:17:16,313][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:17:16,970][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:17:17,629][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:17:18,287][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:17:18,944][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:17:19,602][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:17:20,259][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:17:20,917][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:17:21,574][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:17:22,233][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:17:22,891][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:17:23,549][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:17:24,207][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:17:24,864][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:17:25,522][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:17:26,181][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:17:26,839][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:17:27,498][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:17:28,162][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:17:28,819][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:17:29,478][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:17:30,136][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:17:30,794][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:17:31,785][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:17:32,444][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:17:33,103][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:17:33,761][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:17:34,419][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:17:35,077][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:17:35,735][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:17:36,393][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:17:37,051][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:17:37,710][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:17:38,367][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:17:39,025][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:17:39,683][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:17:40,341][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:17:41,001][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:17:41,659][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:17:42,317][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:17:43,127][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 22:17:44,653][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:17:44,657][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:17:44,658][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:17:46,012][__main__][INFO] - Iteration 534 took 52s (9.40% Gen, 88.01% Train). Generation: 4s, Training: 46s. Estimated remaining time: 6h 20m 5s. Estimated total time: 14h 31m 46s. Time estimates for 10 more iterations: 8m 43s, 100 more iterations: 1h 27m 10s, 500 more iterations: 7h 15m 53s. [2026-03-25 22:17:46,014][__main__][INFO] - Starting iteration 534. [2026-03-25 22:17:46,017][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 22:17:46,017][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:17:51,936][__main__][INFO] - Number of regex retries in iteration 534: 0 [2026-03-25 22:17:51,937][__main__][INFO] - agents played in iteration 534 are Bob, Alice [2026-03-25 22:17:53,102][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:17:53,163][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:17:53,164][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:17:53,164][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:17:53,968][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:17:54,585][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:17:55,247][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:17:55,908][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:17:56,567][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:17:57,227][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:17:57,885][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:17:58,545][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:17:59,204][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:17:59,864][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:18:00,523][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:18:01,181][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:18:01,840][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:18:02,501][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:18:03,162][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:18:03,821][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:18:04,480][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:18:05,139][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:18:05,798][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:18:06,457][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:18:07,116][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:18:07,775][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:18:08,435][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:18:09,094][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:18:09,753][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:18:10,411][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:18:11,070][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:18:11,729][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:18:12,389][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:18:13,047][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:18:13,706][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:18:14,365][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:18:15,024][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:18:15,683][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:18:16,343][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:18:17,002][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:18:17,661][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:18:18,320][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:18:18,979][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:18:19,637][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:18:20,296][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:18:20,955][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:18:21,614][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:18:22,274][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:18:22,932][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:18:23,591][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:18:24,250][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:18:24,909][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:18:25,937][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:18:26,598][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:18:27,255][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:18:27,913][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:18:28,574][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:18:29,233][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:18:29,894][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:18:30,552][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:18:31,213][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:18:31,872][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:18:32,532][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:18:33,190][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:18:33,850][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:18:34,508][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:18:35,166][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:18:35,826][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:18:36,484][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:18:37,282][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 22:18:38,751][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:18:38,754][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:18:38,755][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:18:40,125][__main__][INFO] - Iteration 535 took 54s (10.94% Gen, 86.53% Train). Generation: 5s, Training: 46s. Estimated remaining time: 6h 49m 15s. Estimated total time: 15h 1m 49s. Time estimates for 10 more iterations: 9m 1s, 100 more iterations: 1h 30m 10s, 500 more iterations: 7h 30m 54s. [2026-03-25 22:18:40,128][__main__][INFO] - Starting iteration 535. [2026-03-25 22:18:40,132][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 22:18:40,132][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:18:48,847][__main__][INFO] - Number of regex retries in iteration 535: 0 [2026-03-25 22:18:48,848][__main__][INFO] - agents played in iteration 535 are Bob, Alice [2026-03-25 22:18:49,452][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:18:49,514][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:18:49,515][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:18:49,515][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:18:50,209][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:18:50,815][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:18:51,474][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:18:52,134][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:18:52,792][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:18:53,450][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:18:54,108][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:18:54,766][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:18:55,424][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:18:56,082][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:18:56,741][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:18:57,400][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:18:58,058][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:18:58,716][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:18:59,375][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:19:00,034][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:19:00,692][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:19:01,350][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:19:02,008][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:19:02,666][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:19:03,325][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:19:03,983][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:19:04,640][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:19:05,298][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:19:05,956][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:19:06,614][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:19:07,273][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:19:07,931][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:19:08,589][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:19:09,246][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:19:09,906][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:19:10,564][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:19:11,223][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:19:11,881][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:19:12,539][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:19:13,197][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:19:13,855][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:19:14,514][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:19:15,172][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:19:15,829][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:19:16,487][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:19:17,147][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:19:17,806][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:19:18,463][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:19:19,122][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:19:19,780][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:19:20,438][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:19:21,096][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:19:22,083][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:19:22,742][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:19:23,401][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:19:24,060][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:19:24,718][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:19:25,376][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:19:26,034][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:19:26,692][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:19:27,350][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:19:28,009][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:19:28,668][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:19:29,326][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:19:29,984][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:19:30,641][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:19:31,299][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:19:31,957][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:19:32,615][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:19:33,418][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 22:19:34,786][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:19:34,789][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:19:34,790][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:19:36,347][__main__][INFO] - Iteration 536 took 56s (15.51% Gen, 81.72% Train). Generation: 8s, Training: 45s. Estimated remaining time: 7h 23m 26s. Estimated total time: 15h 36m 56s. Time estimates for 10 more iterations: 9m 22s, 100 more iterations: 1h 33m 41s, 500 more iterations: 7h 48m 28s. [2026-03-25 22:19:36,349][__main__][INFO] - Starting iteration 536. [2026-03-25 22:19:36,352][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 22:19:36,353][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:19:42,816][__main__][INFO] - Number of regex retries in iteration 536: 0 [2026-03-25 22:19:42,818][__main__][INFO] - agents played in iteration 536 are Bob, Alice [2026-03-25 22:19:43,306][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:19:43,370][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:19:43,372][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:19:43,372][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:19:44,272][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:19:44,883][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:19:45,546][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:19:46,205][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:19:46,864][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:19:47,522][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:19:48,182][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:19:48,844][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:19:49,502][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:19:50,160][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:19:50,819][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:19:51,476][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:19:52,137][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:19:52,795][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:19:53,453][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:19:54,110][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:19:54,768][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:19:55,427][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:19:56,085][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:19:56,743][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:19:57,401][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:19:58,059][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:19:58,729][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:19:59,389][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:20:00,046][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:20:00,704][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:20:01,362][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:20:02,021][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:20:02,680][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:20:03,339][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:20:03,998][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:20:04,656][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:20:05,313][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:20:05,972][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:20:06,630][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:20:07,289][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:20:07,947][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:20:08,605][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:20:09,263][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:20:09,922][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:20:10,580][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:20:11,239][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:20:11,897][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:20:12,556][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:20:13,215][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:20:13,879][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:20:14,537][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:20:15,195][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:20:16,186][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:20:16,846][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:20:17,505][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:20:18,163][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:20:18,821][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:20:19,479][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:20:20,137][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:20:20,794][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:20:21,452][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:20:22,110][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:20:22,767][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:20:23,425][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:20:24,083][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:20:24,741][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:20:25,400][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:20:26,059][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:20:26,717][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:20:27,521][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 22:20:28,850][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:20:28,853][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:20:28,854][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:20:30,348][__main__][INFO] - Iteration 537 took 53s (11.97% Gen, 85.26% Train). Generation: 6s, Training: 46s. Estimated remaining time: 6h 45m 32s. Estimated total time: 14h 59m 57s. Time estimates for 10 more iterations: 8m 59s, 100 more iterations: 1h 29m 59s, 500 more iterations: 7h 29m 58s. [2026-03-25 22:20:30,350][__main__][INFO] - Starting iteration 537. [2026-03-25 22:20:30,353][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 22:20:30,354][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:20:35,407][__main__][INFO] - Number of regex retries in iteration 537: 0 [2026-03-25 22:20:35,408][__main__][INFO] - agents played in iteration 537 are Bob, Alice [2026-03-25 22:20:36,020][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:20:36,080][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:20:36,081][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:20:36,081][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:20:36,820][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:20:37,431][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:20:38,091][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:20:38,750][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:20:39,408][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:20:40,068][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:20:40,727][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:20:41,387][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:20:42,045][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:20:42,705][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:20:43,363][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:20:44,020][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:20:44,678][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:20:45,336][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:20:45,996][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:20:46,655][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:20:47,312][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:20:47,969][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:20:48,629][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:20:49,287][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:20:49,945][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:20:50,602][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:20:51,260][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:20:51,918][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:20:52,575][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:20:53,234][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:20:53,892][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:20:54,553][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:20:55,211][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:20:55,871][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:20:56,529][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:20:57,187][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:20:57,845][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:20:58,509][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:20:59,166][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:20:59,823][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:21:00,481][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:21:01,139][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:21:01,796][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:21:02,454][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:21:03,111][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:21:03,769][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:21:04,428][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:21:05,086][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:21:05,744][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:21:06,402][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:21:07,061][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:21:07,719][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:21:08,708][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:21:09,368][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:21:10,026][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:21:10,684][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:21:11,342][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:21:12,000][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:21:12,658][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:21:13,317][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:21:13,975][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:21:14,633][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:21:15,291][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:21:15,949][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:21:16,607][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:21:17,265][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:21:17,924][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:21:18,583][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:21:19,242][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:21:20,022][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 22:21:21,385][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:21:21,388][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:21:21,389][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:21:22,907][__main__][INFO] - Iteration 538 took 52s (9.62% Gen, 87.49% Train). Generation: 5s, Training: 45s. Estimated remaining time: 6h 20m 37s. Estimated total time: 14h 35m 55s. Time estimates for 10 more iterations: 8m 45s, 100 more iterations: 1h 27m 35s, 500 more iterations: 7h 17m 57s. [2026-03-25 22:21:22,909][__main__][INFO] - Starting iteration 538. [2026-03-25 22:21:22,916][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 22:21:22,916][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:21:29,714][__main__][INFO] - Number of regex retries in iteration 538: 0 [2026-03-25 22:21:29,715][__main__][INFO] - agents played in iteration 538 are Bob, Alice [2026-03-25 22:21:30,617][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:21:30,678][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:21:30,679][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:21:30,679][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:21:31,514][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:21:32,123][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:21:32,782][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:21:33,440][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:21:34,098][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:21:34,756][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:21:35,414][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:21:36,072][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:21:36,731][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:21:37,389][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:21:38,047][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:21:38,706][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:21:39,364][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:21:40,022][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:21:40,679][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:21:41,336][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:21:41,995][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:21:42,653][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:21:43,310][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:21:43,969][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:21:44,627][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:21:45,285][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:21:45,943][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:21:46,600][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:21:47,258][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:21:47,916][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:21:48,573][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:21:49,231][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:21:49,889][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:21:50,548][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:21:51,207][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:21:51,865][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:21:52,526][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:21:53,184][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:21:53,843][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:21:54,501][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:21:55,159][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:21:55,817][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:21:56,475][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:21:57,133][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:21:57,792][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:21:58,450][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:21:59,107][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:21:59,768][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:22:00,423][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:22:01,081][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:22:01,739][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:22:02,397][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:22:03,389][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:22:04,049][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:22:04,708][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:22:05,366][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:22:06,024][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:22:06,683][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:22:07,342][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:22:08,000][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:22:08,659][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:22:09,318][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:22:09,976][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:22:10,635][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:22:11,293][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:22:11,951][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:22:12,609][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:22:13,268][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:22:13,926][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:22:14,710][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 22:22:19,015][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:22:19,019][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:22:19,020][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:22:20,525][__main__][INFO] - Iteration 539 took 57s (11.80% Gen, 85.58% Train). Generation: 6s, Training: 49s. Estimated remaining time: 7h 43m 55s. Estimated total time: 16h 0m 10s. Time estimates for 10 more iterations: 9m 36s, 100 more iterations: 1h 36m 1s, 500 more iterations: 8h 0m 5s. [2026-03-25 22:22:20,527][__main__][INFO] - Starting iteration 539. [2026-03-25 22:22:20,533][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 22:22:20,534][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:22:25,438][__main__][INFO] - Number of regex retries in iteration 539: 0 [2026-03-25 22:22:25,439][__main__][INFO] - agents played in iteration 539 are Bob, Alice [2026-03-25 22:22:26,031][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:22:26,092][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:22:26,092][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:22:26,093][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:22:26,786][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:22:27,403][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:22:28,063][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:22:28,719][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:22:29,378][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:22:30,036][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:22:30,695][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:22:31,354][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:22:32,012][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:22:32,671][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:22:33,329][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:22:33,987][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:22:34,646][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:22:35,304][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:22:35,961][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:22:36,620][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:22:37,278][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:22:37,935][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:22:38,593][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:22:39,252][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:22:39,910][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:22:40,568][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:22:41,225][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:22:41,883][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:22:42,540][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:22:43,199][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:22:43,857][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:22:44,515][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:22:45,172][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:22:45,830][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:22:46,487][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:22:47,146][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:22:47,804][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:22:48,463][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:22:49,120][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:22:49,778][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:22:50,436][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:22:51,093][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:22:51,751][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:22:52,409][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:22:53,066][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:22:53,724][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:22:54,382][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:22:55,042][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:22:55,699][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:22:56,357][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:22:57,015][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:22:57,672][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:22:58,720][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:22:59,379][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:23:00,037][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:23:00,695][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:23:01,353][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:23:02,011][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:23:02,671][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:23:03,329][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:23:03,987][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:23:04,645][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:23:05,304][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:23:05,962][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:23:06,620][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:23:07,278][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:23:07,937][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:23:08,595][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:23:09,253][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:23:10,058][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 22:23:11,409][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:23:11,412][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:23:11,413][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:23:12,855][__main__][INFO] - Iteration 540 took 52s (9.37% Gen, 87.86% Train). Generation: 4s, Training: 45s. Estimated remaining time: 6h 14m 58s. Estimated total time: 14h 32m 5s. Time estimates for 10 more iterations: 8m 43s, 100 more iterations: 1h 27m 12s, 500 more iterations: 7h 16m 2s. [2026-03-25 22:23:12,857][__main__][INFO] - Starting iteration 540. [2026-03-25 22:23:12,861][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 22:23:12,862][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:23:25,064][__main__][INFO] - Number of regex retries in iteration 540: 0 [2026-03-25 22:23:25,065][__main__][INFO] - agents played in iteration 540 are Bob, Alice [2026-03-25 22:23:25,602][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:23:25,665][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:23:25,666][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:23:25,666][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:23:26,335][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:23:26,948][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:23:27,607][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:23:28,264][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:23:28,922][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:23:29,579][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:23:30,244][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:23:30,901][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:23:31,561][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:23:32,218][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:23:32,877][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:23:33,534][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:23:34,191][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:23:34,851][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:23:35,509][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:23:36,166][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:23:36,823][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:23:37,480][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:23:38,139][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:23:38,796][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:23:39,454][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:23:40,111][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:23:40,769][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:23:41,427][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:23:42,086][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:23:42,744][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:23:43,402][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:23:44,059][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:23:44,717][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:23:45,374][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:23:46,031][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:23:46,689][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:23:47,347][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:23:48,006][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:23:48,664][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:23:49,322][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:23:49,980][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:23:50,637][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:23:51,296][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:23:51,953][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:23:52,610][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:23:53,268][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:23:53,925][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:23:54,584][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:23:55,242][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:23:55,900][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:23:56,558][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:23:57,215][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:23:58,210][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:23:58,871][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:23:59,530][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:24:00,190][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:24:00,851][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:24:01,508][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:24:02,169][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:24:02,829][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:24:03,488][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:24:04,146][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:24:04,804][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:24:05,463][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:24:06,121][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:24:06,780][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:24:07,438][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:24:08,096][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:24:08,755][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:24:09,533][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 22:24:10,881][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:24:10,884][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:24:10,885][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:24:12,337][__main__][INFO] - Iteration 541 took 59s (20.52% Gen, 77.04% Train). Generation: 12s, Training: 45s. Estimated remaining time: 8h 13m 10s. Estimated total time: 16h 31m 17s. Time estimates for 10 more iterations: 9m 54s, 100 more iterations: 1h 39m 7s, 500 more iterations: 8h 15m 38s. [2026-03-25 22:24:12,339][__main__][INFO] - Starting iteration 541. [2026-03-25 22:24:12,343][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 22:24:12,343][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:24:19,864][__main__][INFO] - Number of regex retries in iteration 541: 0 [2026-03-25 22:24:19,865][__main__][INFO] - agents played in iteration 541 are Bob, Alice [2026-03-25 22:24:21,016][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:24:21,078][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:24:21,078][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:24:21,079][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:24:21,754][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:24:22,364][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:24:23,023][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:24:23,680][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:24:24,338][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:24:24,995][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:24:25,654][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:24:26,311][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:24:26,970][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:24:27,627][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:24:28,285][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:24:28,944][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:24:29,602][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:24:30,259][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:24:30,917][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:24:31,575][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:24:32,232][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:24:32,891][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:24:33,549][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:24:34,206][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:24:34,863][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:24:35,523][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:24:36,180][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:24:36,838][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:24:37,495][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:24:38,153][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:24:38,812][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:24:39,470][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:24:40,128][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:24:40,787][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:24:41,445][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:24:42,102][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:24:42,760][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:24:43,417][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:24:44,076][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:24:44,733][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:24:45,390][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:24:46,048][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:24:46,705][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:24:47,363][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:24:48,021][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:24:48,679][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:24:49,336][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:24:49,994][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:24:50,652][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:24:51,310][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:24:51,967][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:24:52,625][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:24:53,619][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:24:54,277][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:24:54,937][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:24:55,595][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:24:56,254][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:24:56,912][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:24:57,570][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:24:58,229][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:24:58,888][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:24:59,546][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:25:00,204][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:25:00,862][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:25:01,521][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:25:02,179][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:25:02,837][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:25:03,496][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:25:04,154][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:25:04,996][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 22:25:06,353][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:25:06,355][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:25:06,357][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:25:07,789][__main__][INFO] - Iteration 542 took 55s (13.57% Gen, 83.85% Train). Generation: 7s, Training: 46s. Estimated remaining time: 7h 5m 6s. Estimated total time: 15h 24m 8s. Time estimates for 10 more iterations: 9m 14s, 100 more iterations: 1h 32m 24s, 500 more iterations: 7h 42m 4s. [2026-03-25 22:25:07,791][__main__][INFO] - Starting iteration 542. [2026-03-25 22:25:07,796][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 22:25:07,796][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:25:13,288][__main__][INFO] - Number of regex retries in iteration 542: 0 [2026-03-25 22:25:13,289][__main__][INFO] - agents played in iteration 542 are Bob, Alice [2026-03-25 22:25:13,788][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:25:13,850][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:25:13,851][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:25:13,851][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:25:14,583][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:25:15,200][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:25:15,860][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:25:16,519][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:25:17,177][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:25:17,836][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:25:18,494][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:25:19,152][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:25:19,810][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:25:20,468][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:25:21,128][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:25:21,787][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:25:22,445][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:25:23,105][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:25:23,765][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:25:24,424][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:25:25,083][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:25:25,742][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:25:26,401][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:25:27,059][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:25:27,720][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:25:28,378][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:25:29,036][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:25:29,695][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:25:30,354][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:25:31,012][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:25:31,672][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:25:32,332][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:25:32,995][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:25:33,658][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:25:34,316][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:25:34,975][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:25:35,633][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:25:36,291][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:25:36,948][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:25:37,607][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:25:38,265][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:25:38,923][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:25:39,581][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:25:40,241][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:25:40,900][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:25:41,558][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:25:42,216][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:25:42,874][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:25:43,534][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:25:44,190][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:25:44,850][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:25:45,508][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:25:46,495][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:25:47,155][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:25:47,813][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:25:48,471][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:25:49,130][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:25:49,788][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:25:50,447][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:25:51,107][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:25:51,766][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:25:52,424][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:25:53,083][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:25:53,742][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:25:54,402][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:25:55,061][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:25:55,719][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:25:56,378][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:25:57,041][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:25:57,847][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 22:25:59,194][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:25:59,196][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:25:59,197][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:26:00,719][__main__][INFO] - Iteration 543 took 52s (10.38% Gen, 86.74% Train). Generation: 5s, Training: 45s. Estimated remaining time: 6h 22m 10s. Estimated total time: 14h 42m 5s. Time estimates for 10 more iterations: 8m 49s, 100 more iterations: 1h 28m 12s, 500 more iterations: 7h 21m 2s. [2026-03-25 22:26:00,721][__main__][INFO] - Starting iteration 543. [2026-03-25 22:26:00,725][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 22:26:00,726][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:26:05,595][__main__][INFO] - Number of regex retries in iteration 543: 0 [2026-03-25 22:26:05,596][__main__][INFO] - agents played in iteration 543 are Bob, Alice [2026-03-25 22:26:06,089][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:26:06,152][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:26:06,153][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:26:06,153][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:26:06,851][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:26:07,456][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:26:08,116][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:26:08,774][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:26:09,434][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:26:10,093][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:26:10,751][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:26:11,410][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:26:12,068][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:26:12,727][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:26:13,385][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:26:14,044][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:26:14,702][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:26:15,360][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:26:16,020][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:26:16,678][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:26:17,336][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:26:17,995][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:26:18,654][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:26:19,318][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:26:19,976][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:26:20,633][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:26:21,292][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:26:21,950][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:26:22,607][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:26:23,265][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:26:23,922][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:26:24,580][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:26:25,237][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:26:25,895][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:26:26,553][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:26:27,212][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:26:27,870][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:26:28,529][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:26:29,187][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:26:29,847][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:26:30,505][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:26:31,162][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:26:31,820][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:26:32,479][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:26:33,137][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:26:33,794][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:26:34,453][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:26:35,110][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:26:35,768][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:26:36,427][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:26:37,085][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:26:37,744][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:26:38,737][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:26:39,396][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:26:40,054][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:26:40,713][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:26:41,372][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:26:42,031][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:26:42,690][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:26:43,348][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:26:44,007][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:26:44,665][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:26:45,323][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:26:45,980][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:26:46,638][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:26:47,296][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:26:47,954][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:26:48,613][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:26:49,271][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:26:50,032][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 22:26:51,396][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:26:51,398][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:26:51,399][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:26:52,819][__main__][INFO] - Iteration 544 took 52s (9.35% Gen, 87.92% Train). Generation: 4s, Training: 45s. Estimated remaining time: 6h 7m 27s. Estimated total time: 14h 28m 15s. Time estimates for 10 more iterations: 8m 40s, 100 more iterations: 1h 26m 49s, 500 more iterations: 7h 14m 7s. [2026-03-25 22:26:52,821][__main__][INFO] - Starting iteration 544. [2026-03-25 22:26:52,824][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 22:26:52,825][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:27:00,275][__main__][INFO] - Number of regex retries in iteration 544: 0 [2026-03-25 22:27:00,276][__main__][INFO] - agents played in iteration 544 are Bob, Alice [2026-03-25 22:27:00,882][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:27:00,944][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:27:00,944][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:27:00,945][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:27:01,849][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:27:02,468][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:27:03,127][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:27:03,787][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:27:04,446][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:27:05,106][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:27:05,763][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:27:06,422][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:27:07,081][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:27:07,739][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:27:08,397][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:27:09,055][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:27:09,713][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:27:10,372][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:27:11,030][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:27:11,688][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:27:12,347][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:27:13,005][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:27:13,666][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:27:14,324][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:27:14,981][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:27:15,640][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:27:16,300][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:27:16,958][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:27:17,616][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:27:18,275][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:27:18,933][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:27:19,592][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:27:20,250][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:27:20,909][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:27:21,569][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:27:22,227][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:27:22,885][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:27:23,546][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:27:24,205][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:27:24,862][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:27:25,520][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:27:26,179][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:27:26,836][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:27:27,494][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:27:28,154][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:27:28,812][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:27:29,470][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:27:30,129][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:27:30,789][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:27:31,447][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:27:32,112][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:27:32,769][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:27:33,752][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:27:34,410][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:27:35,069][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:27:35,727][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:27:36,384][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:27:37,042][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:27:37,700][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:27:38,361][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:27:39,019][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:27:39,678][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:27:40,337][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:27:40,996][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:27:41,655][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:27:42,314][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:27:42,972][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:27:43,631][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:27:44,289][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:27:45,083][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 22:27:47,062][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:27:47,065][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:27:47,066][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:27:48,541][__main__][INFO] - Iteration 545 took 55s (13.37% Gen, 83.98% Train). Generation: 7s, Training: 46s. Estimated remaining time: 7h 6m 55s. Estimated total time: 15h 28m 38s. Time estimates for 10 more iterations: 9m 17s, 100 more iterations: 1h 32m 51s, 500 more iterations: 7h 44m 19s. [2026-03-25 22:27:48,543][__main__][INFO] - Starting iteration 545. [2026-03-25 22:27:48,547][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 22:27:48,547][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:27:53,743][__main__][INFO] - Number of regex retries in iteration 545: 0 [2026-03-25 22:27:53,744][__main__][INFO] - agents played in iteration 545 are Bob, Alice [2026-03-25 22:27:54,249][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:27:54,310][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:27:54,310][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:27:54,311][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:27:55,083][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:27:55,688][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:27:56,347][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:27:57,006][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:27:57,663][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:27:58,320][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:27:58,979][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:27:59,637][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:28:00,294][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:28:00,951][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:28:01,609][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:28:02,267][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:28:02,925][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:28:03,583][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:28:04,241][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:28:04,899][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:28:05,556][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:28:06,214][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:28:06,872][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:28:07,530][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:28:08,187][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:28:08,845][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:28:09,503][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:28:10,161][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:28:10,819][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:28:11,478][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:28:12,135][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:28:12,793][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:28:13,454][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:28:14,117][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:28:14,777][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:28:15,436][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:28:16,095][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:28:16,754][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:28:17,412][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:28:18,070][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:28:18,728][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:28:19,387][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:28:20,045][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:28:20,703][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:28:21,363][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:28:22,022][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:28:22,680][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:28:23,338][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:28:23,997][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:28:24,656][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:28:25,314][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:28:25,972][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:28:26,956][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:28:27,616][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:28:28,274][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:28:28,933][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:28:29,592][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:28:30,250][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:28:30,909][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:28:31,569][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:28:32,228][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:28:32,886][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:28:33,544][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:28:34,203][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:28:34,862][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:28:35,520][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:28:36,179][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:28:36,838][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:28:37,496][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:28:38,277][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 22:28:39,641][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:28:39,644][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:28:39,645][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:28:41,055][__main__][INFO] - Iteration 546 took 52s (9.90% Gen, 87.41% Train). Generation: 5s, Training: 45s. Estimated remaining time: 6h 12m 34s. Estimated total time: 14h 35m 9s. Time estimates for 10 more iterations: 8m 45s, 100 more iterations: 1h 27m 30s, 500 more iterations: 7h 17m 34s. [2026-03-25 22:28:41,057][__main__][INFO] - Starting iteration 546. [2026-03-25 22:28:41,061][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 22:28:41,061][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:28:47,265][__main__][INFO] - Number of regex retries in iteration 546: 0 [2026-03-25 22:28:47,266][__main__][INFO] - agents played in iteration 546 are Bob, Alice [2026-03-25 22:28:48,127][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:28:48,190][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:28:48,191][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:28:48,191][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:28:48,935][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:28:49,545][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:28:50,205][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:28:50,864][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:28:51,522][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:28:52,181][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:28:52,839][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:28:53,496][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:28:54,156][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:28:54,814][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:28:55,472][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:28:56,130][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:28:56,788][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:28:57,446][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:28:58,104][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:28:58,763][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:28:59,420][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:29:00,078][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:29:00,737][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:29:01,395][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:29:02,053][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:29:02,711][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:29:03,369][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:29:04,027][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:29:04,685][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:29:05,343][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:29:06,000][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:29:06,658][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:29:07,315][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:29:07,973][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:29:08,630][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:29:09,289][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:29:09,950][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:29:10,606][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:29:11,264][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:29:11,924][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:29:12,583][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:29:13,243][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:29:13,903][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:29:14,564][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:29:15,226][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:29:15,882][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:29:16,541][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:29:17,201][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:29:17,859][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:29:18,517][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:29:19,175][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:29:19,833][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:29:20,831][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:29:21,490][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:29:22,148][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:29:22,806][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:29:23,463][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:29:24,122][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:29:24,780][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:29:25,439][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:29:26,097][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:29:26,755][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:29:27,413][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:29:28,073][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:29:28,736][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:29:29,397][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:29:30,056][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:29:30,714][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:29:31,373][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:29:32,243][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 22:29:33,557][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:29:33,560][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:29:33,561][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:29:34,960][__main__][INFO] - Iteration 547 took 53s (11.51% Gen, 85.89% Train). Generation: 6s, Training: 46s. Estimated remaining time: 6h 34m 51s. Estimated total time: 14h 58m 21s. Time estimates for 10 more iterations: 8m 59s, 100 more iterations: 1h 29m 50s, 500 more iterations: 7h 29m 10s. [2026-03-25 22:29:34,963][__main__][INFO] - Starting iteration 547. [2026-03-25 22:29:34,967][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 22:29:34,968][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:29:40,645][__main__][INFO] - Number of regex retries in iteration 547: 0 [2026-03-25 22:29:40,647][__main__][INFO] - agents played in iteration 547 are Bob, Alice [2026-03-25 22:29:41,250][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:29:41,313][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:29:41,313][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:29:41,314][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:29:42,033][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:29:42,645][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:29:43,305][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:29:43,964][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:29:44,624][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:29:45,284][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:29:45,943][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:29:46,603][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:29:47,261][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:29:47,923][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:29:48,582][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:29:49,241][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:29:49,900][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:29:50,558][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:29:51,217][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:29:51,876][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:29:52,534][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:29:53,193][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:29:53,852][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:29:54,514][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:29:55,173][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:29:55,831][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:29:56,490][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:29:57,148][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:29:57,806][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:29:58,466][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:29:59,125][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:29:59,784][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:30:00,443][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:30:01,105][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:30:01,764][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:30:02,422][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:30:03,080][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:30:03,739][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:30:04,398][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:30:05,057][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:30:05,715][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:30:06,374][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:30:07,033][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:30:07,692][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:30:08,351][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:30:09,011][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:30:09,671][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:30:10,332][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:30:10,992][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:30:11,652][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:30:12,311][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:30:12,971][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:30:13,955][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:30:14,613][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:30:15,273][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:30:15,932][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:30:16,592][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:30:17,251][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:30:17,909][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:30:18,568][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:30:19,228][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:30:19,886][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:30:20,544][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:30:21,202][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:30:21,860][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:30:22,519][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:30:23,179][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:30:23,837][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:30:24,496][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:30:25,314][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 22:30:26,674][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:30:26,677][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:30:26,678][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:30:28,099][__main__][INFO] - Iteration 548 took 53s (10.69% Gen, 86.63% Train). Generation: 5s, Training: 46s. Estimated remaining time: 6h 21m 11s. Estimated total time: 14h 45m 34s. Time estimates for 10 more iterations: 8m 51s, 100 more iterations: 1h 28m 33s, 500 more iterations: 7h 22m 47s. [2026-03-25 22:30:28,103][__main__][INFO] - Starting iteration 548. [2026-03-25 22:30:28,106][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 22:30:28,107][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:30:33,156][__main__][INFO] - Number of regex retries in iteration 548: 0 [2026-03-25 22:30:33,157][__main__][INFO] - agents played in iteration 548 are Bob, Alice [2026-03-25 22:30:33,660][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:30:33,721][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:30:33,722][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:30:33,722][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:30:34,551][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:30:35,165][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:30:35,824][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:30:36,482][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:30:37,141][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:30:37,800][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:30:38,459][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:30:39,118][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:30:39,776][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:30:40,435][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:30:41,093][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:30:41,751][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:30:42,410][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:30:43,068][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:30:43,725][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:30:44,383][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:30:45,041][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:30:45,700][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:30:46,359][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:30:47,018][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:30:47,676][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:30:48,333][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:30:48,991][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:30:49,649][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:30:50,306][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:30:50,964][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:30:51,624][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:30:52,287][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:30:52,947][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:30:53,606][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:30:54,265][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:30:54,926][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:30:55,584][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:30:56,243][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:30:56,902][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:30:57,560][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:30:58,219][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:30:58,878][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:30:59,538][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:31:00,203][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:31:00,861][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:31:01,518][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:31:02,175][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:31:02,834][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:31:03,492][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:31:04,149][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:31:04,808][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:31:05,466][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:31:06,462][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:31:07,120][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:31:07,778][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:31:08,437][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:31:09,095][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:31:09,752][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:31:10,410][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:31:11,068][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:31:11,725][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:31:12,383][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:31:13,041][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:31:13,700][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:31:14,358][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:31:15,017][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:31:15,676][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:31:16,333][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:31:16,990][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:31:17,779][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 22:31:19,116][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:31:19,119][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:31:19,120][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:31:20,566][__main__][INFO] - Iteration 549 took 52s (9.63% Gen, 87.61% Train). Generation: 5s, Training: 45s. Estimated remaining time: 6h 9m 6s. Estimated total time: 14h 34m 21s. Time estimates for 10 more iterations: 8m 44s, 100 more iterations: 1h 27m 26s, 500 more iterations: 7h 17m 10s. [2026-03-25 22:31:20,568][__main__][INFO] - Starting iteration 549. [2026-03-25 22:31:20,572][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 22:31:20,573][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:31:26,764][__main__][INFO] - Number of regex retries in iteration 549: 0 [2026-03-25 22:31:26,765][__main__][INFO] - agents played in iteration 549 are Bob, Alice [2026-03-25 22:31:27,247][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:31:27,308][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:31:27,308][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:31:27,309][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:31:28,153][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:31:28,765][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:31:29,428][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:31:30,087][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:31:30,745][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:31:31,402][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:31:32,060][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:31:32,717][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:31:33,376][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:31:34,033][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:31:34,691][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:31:35,349][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:31:36,008][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:31:36,665][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:31:37,323][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:31:37,983][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:31:38,641][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:31:39,299][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:31:39,957][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:31:40,615][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:31:41,274][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:31:41,932][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:31:42,590][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:31:43,248][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:31:43,906][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:31:44,564][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:31:45,221][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:31:45,880][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:31:46,538][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:31:47,195][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:31:47,853][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:31:48,510][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:31:49,167][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:31:49,825][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:31:50,483][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:31:51,141][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:31:51,799][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:31:52,457][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:31:53,114][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:31:53,771][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:31:54,429][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:31:55,086][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:31:55,743][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:31:56,401][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:31:57,059][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:31:57,716][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:31:58,375][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:31:59,033][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:32:00,020][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:32:00,681][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:32:01,338][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:32:01,996][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:32:02,655][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:32:03,314][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:32:03,971][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:32:04,630][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:32:05,288][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:32:05,946][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:32:06,604][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:32:07,263][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:32:07,921][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:32:08,581][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:32:09,240][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:32:09,898][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:32:10,556][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:32:11,371][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 22:32:12,728][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:32:12,731][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:32:12,732][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:32:14,216][__main__][INFO] - Iteration 550 took 53s (11.54% Gen, 85.69% Train). Generation: 6s, Training: 45s. Estimated remaining time: 6h 27m 57s. Estimated total time: 14h 54m 5s. Time estimates for 10 more iterations: 8m 56s, 100 more iterations: 1h 29m 24s, 500 more iterations: 7h 27m 2s. [2026-03-25 22:32:14,218][__main__][INFO] - Starting iteration 550. [2026-03-25 22:32:14,222][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 22:32:14,222][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:32:24,131][__main__][INFO] - Number of regex retries in iteration 550: 0 [2026-03-25 22:32:24,132][__main__][INFO] - agents played in iteration 550 are Bob, Alice [2026-03-25 22:32:25,008][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:32:25,070][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:32:25,070][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:32:25,071][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:32:25,851][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:32:26,460][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:32:27,119][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:32:27,777][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:32:28,436][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:32:29,094][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:32:29,753][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:32:30,411][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:32:31,069][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:32:31,728][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:32:32,386][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:32:33,047][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:32:33,705][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:32:34,362][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:32:35,020][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:32:35,677][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:32:36,335][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:32:36,995][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:32:37,653][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:32:38,310][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:32:38,967][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:32:39,625][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:32:40,283][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:32:40,941][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:32:41,599][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:32:42,256][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:32:42,914][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:32:43,572][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:32:44,229][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:32:44,887][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:32:45,545][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:32:46,204][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:32:46,862][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:32:47,520][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:32:48,178][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:32:48,836][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:32:49,495][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:32:50,153][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:32:50,811][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:32:51,468][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:32:52,126][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:32:52,784][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:32:53,442][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:32:54,100][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:32:54,757][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:32:55,415][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:32:56,073][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:32:56,730][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:32:57,716][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:32:58,375][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:32:59,035][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:32:59,693][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:33:00,352][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:33:01,010][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:33:01,668][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:33:02,326][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:33:02,985][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:33:03,644][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:33:04,303][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:33:04,961][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:33:05,619][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:33:06,277][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:33:06,935][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:33:07,593][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:33:08,251][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:33:09,030][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 22:33:10,394][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:33:10,396][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:33:10,398][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:33:13,441][__main__][INFO] - Iteration 551 took 59s (16.73% Gen, 78.12% Train). Generation: 9s, Training: 46s. Estimated remaining time: 7h 59m 52s. Estimated total time: 16h 27m 0s. Time estimates for 10 more iterations: 9m 52s, 100 more iterations: 1h 38m 42s, 500 more iterations: 8h 13m 30s. [2026-03-25 22:33:13,444][__main__][INFO] - Starting iteration 551. [2026-03-25 22:33:13,448][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 22:33:13,448][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:33:19,257][__main__][INFO] - Number of regex retries in iteration 551: 0 [2026-03-25 22:33:19,258][__main__][INFO] - agents played in iteration 551 are Bob, Alice [2026-03-25 22:33:20,014][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:33:20,075][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:33:20,075][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:33:20,076][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:33:20,915][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:33:21,534][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:33:22,193][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:33:22,856][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:33:23,514][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:33:24,172][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:33:24,830][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:33:25,489][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:33:26,146][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:33:26,803][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:33:27,462][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:33:28,119][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:33:28,777][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:33:29,435][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:33:30,092][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:33:30,749][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:33:31,407][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:33:32,065][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:33:32,731][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:33:33,384][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:33:34,043][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:33:34,703][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:33:35,360][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:33:36,018][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:33:36,675][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:33:37,332][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:33:37,990][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:33:38,648][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:33:39,306][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:33:39,965][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:33:40,623][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:33:41,281][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:33:41,938][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:33:42,598][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:33:43,256][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:33:43,913][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:33:44,571][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:33:45,228][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:33:45,887][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:33:46,546][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:33:47,204][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:33:47,864][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:33:48,521][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:33:49,179][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:33:49,838][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:33:50,496][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:33:51,155][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:33:51,812][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:33:52,831][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:33:53,490][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:33:54,149][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:33:54,807][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:33:55,467][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:33:56,125][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:33:56,783][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:33:57,442][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:33:58,100][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:33:58,760][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:33:59,417][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:34:00,076][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:34:00,734][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:34:01,393][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:34:02,051][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:34:02,708][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:34:03,367][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:34:04,142][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 22:34:05,522][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:34:05,524][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:34:05,526][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:34:07,044][__main__][INFO] - Iteration 552 took 53s (10.84% Gen, 86.32% Train). Generation: 5s, Training: 46s. Estimated remaining time: 6h 25m 16s. Estimated total time: 14h 53m 17s. Time estimates for 10 more iterations: 8m 55s, 100 more iterations: 1h 29m 19s, 500 more iterations: 7h 26m 38s. [2026-03-25 22:34:07,046][__main__][INFO] - Starting iteration 552. [2026-03-25 22:34:07,050][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 22:34:07,051][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:34:12,981][__main__][INFO] - Number of regex retries in iteration 552: 0 [2026-03-25 22:34:12,982][__main__][INFO] - agents played in iteration 552 are Bob, Alice [2026-03-25 22:34:13,498][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:34:13,561][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:34:13,562][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:34:13,562][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:34:14,266][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:34:14,882][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:34:15,541][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:34:16,200][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:34:16,859][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:34:17,517][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:34:18,176][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:34:18,835][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:34:19,493][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:34:20,151][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:34:20,809][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:34:21,468][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:34:22,126][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:34:22,784][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:34:23,445][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:34:24,103][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:34:24,762][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:34:25,421][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:34:26,079][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:34:26,826][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:34:27,508][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:34:28,168][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:34:28,826][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:34:29,485][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:34:30,144][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:34:30,801][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:34:31,458][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:34:32,117][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:34:32,774][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:34:33,433][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:34:34,092][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:34:34,749][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:34:35,407][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:34:36,064][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:34:36,720][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:34:37,379][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:34:38,037][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:34:38,696][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:34:39,354][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:34:40,011][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:34:40,671][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:34:41,329][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:34:41,987][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:34:42,645][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:34:43,303][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:34:43,961][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:34:44,619][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:34:45,277][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:34:46,295][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:34:46,954][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:34:47,613][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:34:48,271][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:34:48,929][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:34:49,588][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:34:50,245][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:34:50,903][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:34:51,562][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:34:52,221][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:34:52,879][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:34:53,537][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:34:54,195][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:34:54,852][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:34:55,510][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:34:56,168][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:34:56,826][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:34:57,631][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 22:34:58,975][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:34:58,977][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:34:58,979][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:35:00,451][__main__][INFO] - Iteration 553 took 53s (11.11% Gen, 86.13% Train). Generation: 5s, Training: 45s. Estimated remaining time: 6h 21m 7s. Estimated total time: 14h 50m 3s. Time estimates for 10 more iterations: 8m 54s, 100 more iterations: 1h 29m 0s, 500 more iterations: 7h 25m 1s. [2026-03-25 22:35:00,454][__main__][INFO] - Starting iteration 553. [2026-03-25 22:35:00,469][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 22:35:00,469][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:35:05,540][__main__][INFO] - Number of regex retries in iteration 553: 0 [2026-03-25 22:35:05,541][__main__][INFO] - agents played in iteration 553 are Bob, Alice [2026-03-25 22:35:06,138][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:35:06,200][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:35:06,201][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:35:06,201][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:35:06,924][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:35:07,527][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:35:08,186][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:35:08,843][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:35:09,501][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:35:10,159][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:35:10,817][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:35:11,474][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:35:12,133][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:35:12,793][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:35:13,451][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:35:14,109][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:35:14,767][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:35:15,425][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:35:16,083][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:35:16,741][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:35:17,399][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:35:18,058][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:35:18,716][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:35:19,376][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:35:20,034][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:35:20,692][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:35:21,349][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:35:22,007][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:35:22,665][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:35:23,322][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:35:23,980][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:35:24,640][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:35:25,298][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:35:25,957][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:35:26,617][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:35:27,275][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:35:27,934][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:35:28,592][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:35:29,256][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:35:29,915][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:35:30,572][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:35:31,231][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:35:31,889][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:35:32,547][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:35:33,205][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:35:33,861][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:35:34,520][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:35:35,178][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:35:35,837][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:35:36,496][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:35:37,153][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:35:37,812][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:35:38,813][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:35:39,474][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:35:40,132][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:35:40,790][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:35:41,452][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:35:42,110][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:35:42,768][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:35:43,428][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:35:44,087][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:35:44,746][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:35:45,404][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:35:46,061][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:35:46,719][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:35:47,382][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:35:48,041][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:35:48,701][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:35:49,360][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:35:50,166][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 22:35:51,514][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:35:51,517][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:35:51,518][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:35:52,945][__main__][INFO] - Iteration 554 took 52s (9.66% Gen, 87.61% Train). Generation: 5s, Training: 45s. Estimated remaining time: 6h 4m 50s. Estimated total time: 14h 34m 38s. Time estimates for 10 more iterations: 8m 44s, 100 more iterations: 1h 27m 27s, 500 more iterations: 7h 17m 19s. [2026-03-25 22:35:52,947][__main__][INFO] - Starting iteration 554. [2026-03-25 22:35:52,951][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 22:35:52,952][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:35:58,694][__main__][INFO] - Number of regex retries in iteration 554: 0 [2026-03-25 22:35:58,695][__main__][INFO] - agents played in iteration 554 are Bob, Alice [2026-03-25 22:35:59,638][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:35:59,698][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:35:59,699][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:35:59,700][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:36:00,506][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:36:01,126][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:36:01,786][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:36:02,446][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:36:03,104][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:36:03,764][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:36:04,423][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:36:05,089][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:36:05,749][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:36:06,407][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:36:07,065][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:36:07,724][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:36:08,383][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:36:09,041][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:36:09,699][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:36:10,357][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:36:11,015][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:36:11,672][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:36:12,331][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:36:12,989][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:36:13,647][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:36:14,304][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:36:14,963][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:36:15,621][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:36:16,279][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:36:16,937][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:36:17,594][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:36:18,252][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:36:18,910][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:36:19,567][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:36:20,225][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:36:20,883][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:36:21,541][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:36:22,199][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:36:22,857][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:36:23,516][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:36:24,175][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:36:24,832][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:36:25,491][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:36:26,149][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:36:26,807][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:36:27,465][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:36:28,124][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:36:28,784][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:36:29,442][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:36:30,100][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:36:30,759][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:36:31,416][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:36:32,421][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:36:33,080][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:36:33,741][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:36:34,400][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:36:35,059][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:36:35,717][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:36:36,377][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:36:37,034][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:36:37,695][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:36:38,353][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:36:39,015][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:36:39,674][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:36:40,333][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:36:40,992][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:36:41,651][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:36:42,309][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:36:42,969][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:36:43,765][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 22:36:45,606][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:36:45,609][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:36:45,610][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:36:47,081][__main__][INFO] - Iteration 555 took 54s (10.61% Gen, 86.67% Train). Generation: 5s, Training: 46s. Estimated remaining time: 6h 31m 30s. Estimated total time: 15h 2m 12s. Time estimates for 10 more iterations: 9m 1s, 100 more iterations: 1h 30m 13s, 500 more iterations: 7h 31m 6s. [2026-03-25 22:36:47,083][__main__][INFO] - Starting iteration 555. [2026-03-25 22:36:47,087][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 22:36:47,088][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:36:55,419][__main__][INFO] - Number of regex retries in iteration 555: 0 [2026-03-25 22:36:55,421][__main__][INFO] - agents played in iteration 555 are Bob, Alice [2026-03-25 22:36:56,024][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:36:56,085][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:36:56,085][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:36:56,086][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:36:56,754][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:36:57,363][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:36:58,024][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:36:58,685][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:36:59,345][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:37:00,005][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:37:00,666][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:37:01,331][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:37:01,995][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:37:02,653][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:37:03,312][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:37:03,970][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:37:04,629][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:37:05,288][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:37:05,946][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:37:06,604][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:37:07,264][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:37:07,922][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:37:08,581][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:37:09,240][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:37:09,900][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:37:10,559][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:37:11,217][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:37:11,875][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:37:12,534][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:37:13,193][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:37:13,851][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:37:14,510][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:37:15,168][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:37:15,827][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:37:16,486][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:37:17,145][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:37:17,804][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:37:18,462][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:37:19,120][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:37:19,778][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:37:20,437][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:37:21,095][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:37:21,754][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:37:22,417][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:37:23,077][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:37:23,740][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:37:24,396][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:37:25,057][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:37:25,718][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:37:26,377][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:37:27,037][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:37:27,696][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:37:28,686][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:37:29,347][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:37:30,006][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:37:30,665][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:37:31,323][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:37:31,983][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:37:32,643][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:37:33,302][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:37:33,961][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:37:34,619][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:37:35,277][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:37:35,936][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:37:36,594][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:37:37,252][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:37:37,910][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:37:38,568][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:37:39,227][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:37:40,005][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 22:37:41,592][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:37:41,594][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:37:41,596][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:37:43,077][__main__][INFO] - Iteration 556 took 55s (14.88% Gen, 82.47% Train). Generation: 8s, Training: 46s. Estimated remaining time: 7h 1m 33s. Estimated total time: 15h 33m 11s. Time estimates for 10 more iterations: 9m 19s, 100 more iterations: 1h 33m 19s, 500 more iterations: 7h 46m 35s. [2026-03-25 22:37:43,080][__main__][INFO] - Starting iteration 556. [2026-03-25 22:37:43,084][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 22:37:43,085][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:37:48,019][__main__][INFO] - Number of regex retries in iteration 556: 0 [2026-03-25 22:37:48,021][__main__][INFO] - agents played in iteration 556 are Bob, Alice [2026-03-25 22:37:48,513][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:37:48,575][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:37:48,576][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:37:48,576][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:37:49,345][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:37:49,963][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:37:50,623][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:37:51,281][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:37:51,941][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:37:52,602][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:37:53,259][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:37:53,917][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:37:54,575][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:37:55,234][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:37:55,892][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:37:56,549][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:37:57,209][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:37:57,867][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:37:58,526][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:37:59,186][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:37:59,845][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:38:00,503][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:38:01,162][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:38:01,820][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:38:02,479][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:38:03,138][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:38:03,796][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:38:04,454][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:38:05,111][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:38:05,769][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:38:06,427][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:38:07,085][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:38:07,742][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:38:08,401][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:38:09,059][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:38:09,720][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:38:10,380][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:38:11,039][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:38:11,698][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:38:12,356][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:38:13,014][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:38:13,673][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:38:14,331][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:38:14,989][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:38:15,649][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:38:16,310][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:38:16,969][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:38:17,628][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:38:18,286][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:38:18,943][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:38:19,606][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:38:20,264][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:38:21,281][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:38:21,939][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:38:22,598][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:38:23,256][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:38:23,915][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:38:24,573][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:38:25,232][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:38:25,890][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:38:26,549][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:38:27,208][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:38:27,867][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:38:28,526][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:38:29,189][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:38:29,848][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:38:30,506][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:38:31,165][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:38:31,824][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:38:32,672][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 22:38:34,525][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:38:34,528][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:38:34,529][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:38:36,005][__main__][INFO] - Iteration 557 took 52s (9.33% Gen, 87.88% Train). Generation: 4s, Training: 46s. Estimated remaining time: 6h 9m 32s. Estimated total time: 14h 42m 2s. Time estimates for 10 more iterations: 8m 49s, 100 more iterations: 1h 28m 12s, 500 more iterations: 7h 21m 1s. [2026-03-25 22:38:36,007][__main__][INFO] - Starting iteration 557. [2026-03-25 22:38:36,011][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 22:38:36,011][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:38:41,066][__main__][INFO] - Number of regex retries in iteration 557: 0 [2026-03-25 22:38:41,068][__main__][INFO] - agents played in iteration 557 are Bob, Alice [2026-03-25 22:38:41,690][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:38:41,752][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:38:41,752][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:38:41,753][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:38:42,512][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:38:43,128][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:38:43,788][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:38:44,450][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:38:45,109][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:38:45,769][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:38:46,428][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:38:47,087][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:38:47,745][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:38:48,405][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:38:49,065][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:38:49,724][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:38:50,384][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:38:51,045][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:38:51,707][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:38:52,368][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:38:53,027][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:38:53,687][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:38:54,347][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:38:55,006][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:38:55,665][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:38:56,325][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:38:56,985][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:38:57,644][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:38:58,304][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:38:58,964][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:38:59,626][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:39:00,283][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:39:00,943][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:39:01,603][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:39:02,262][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:39:02,921][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:39:03,580][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:39:04,239][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:39:04,898][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:39:05,557][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:39:06,215][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:39:06,874][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:39:07,534][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:39:08,193][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:39:08,852][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:39:09,510][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:39:10,169][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:39:10,828][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:39:11,487][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:39:12,146][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:39:12,805][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:39:13,463][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:39:14,454][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:39:15,113][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:39:15,771][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:39:16,428][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:39:17,086][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:39:17,744][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:39:18,405][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:39:19,063][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:39:19,723][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:39:20,382][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:39:21,040][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:39:21,697][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:39:22,354][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:39:23,012][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:39:23,670][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:39:24,328][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:39:24,986][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:39:25,776][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 22:39:27,100][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:39:27,103][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:39:27,104][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:39:28,601][__main__][INFO] - Iteration 558 took 52s (9.61% Gen, 87.53% Train). Generation: 5s, Training: 46s. Estimated remaining time: 6h 3m 8s. Estimated total time: 14h 36m 31s. Time estimates for 10 more iterations: 8m 45s, 100 more iterations: 1h 27m 39s, 500 more iterations: 7h 18m 15s. [2026-03-25 22:39:28,603][__main__][INFO] - Starting iteration 558. [2026-03-25 22:39:28,606][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 22:39:28,607][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:39:33,315][__main__][INFO] - Number of regex retries in iteration 558: 0 [2026-03-25 22:39:33,316][__main__][INFO] - agents played in iteration 558 are Bob, Alice [2026-03-25 22:39:34,192][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:39:34,254][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:39:34,255][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:39:34,255][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:39:34,925][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:39:35,532][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:39:36,191][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:39:36,849][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:39:37,507][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:39:38,167][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:39:38,825][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:39:39,485][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:39:40,144][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:39:40,801][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:39:41,460][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:39:42,120][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:39:42,778][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:39:43,436][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:39:44,095][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:39:44,753][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:39:45,410][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:39:46,067][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:39:46,726][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:39:47,382][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:39:48,041][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:39:48,699][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:39:49,358][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:39:50,016][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:39:50,674][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:39:51,333][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:39:51,991][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:39:52,649][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:39:53,308][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:39:53,966][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:39:54,625][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:39:55,282][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:39:55,940][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:39:56,598][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:39:57,255][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:39:57,914][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:39:58,572][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:39:59,230][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:39:59,888][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:40:00,546][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:40:01,207][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:40:01,866][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:40:02,524][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:40:03,182][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:40:03,840][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:40:04,498][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:40:05,156][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:40:05,814][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:40:06,798][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:40:07,457][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:40:08,116][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:40:08,773][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:40:09,430][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:40:10,090][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:40:10,748][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:40:11,406][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:40:12,064][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:40:12,722][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:40:13,380][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:40:14,037][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:40:14,695][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:40:15,353][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:40:16,012][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:40:16,670][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:40:17,329][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:40:18,137][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 22:40:19,501][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:40:19,504][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:40:19,505][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:40:20,974][__main__][INFO] - Iteration 559 took 52s (8.99% Gen, 88.20% Train). Generation: 4s, Training: 46s. Estimated remaining time: 5h 58m 33s. Estimated total time: 14h 32m 49s. Time estimates for 10 more iterations: 8m 43s, 100 more iterations: 1h 27m 16s, 500 more iterations: 7h 16m 24s. [2026-03-25 22:40:20,976][__main__][INFO] - Starting iteration 559. [2026-03-25 22:40:20,981][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 22:40:20,981][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:40:26,574][__main__][INFO] - Number of regex retries in iteration 559: 0 [2026-03-25 22:40:26,575][__main__][INFO] - agents played in iteration 559 are Bob, Alice [2026-03-25 22:40:27,631][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:40:27,692][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:40:27,692][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:40:27,693][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:40:28,365][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:40:28,977][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:40:29,636][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:40:30,293][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:40:30,950][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:40:31,609][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:40:32,266][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:40:32,924][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:40:33,582][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:40:34,239][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:40:34,898][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:40:35,557][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:40:36,214][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:40:36,873][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:40:37,531][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:40:38,189][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:40:38,848][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:40:39,518][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:40:40,176][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:40:40,835][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:40:41,494][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:40:42,152][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:40:42,810][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:40:43,469][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:40:44,127][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:40:44,785][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:40:45,443][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:40:46,102][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:40:46,762][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:40:47,420][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:40:48,078][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:40:48,735][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:40:49,392][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:40:50,050][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:40:50,709][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:40:51,367][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:40:52,025][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:40:52,684][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:40:53,342][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:40:53,999][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:40:54,657][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:40:55,316][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:40:55,975][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:40:56,633][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:40:57,291][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:40:57,949][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:40:58,609][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:40:59,267][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:41:00,255][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:41:00,917][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:41:01,576][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:41:02,235][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:41:02,892][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:41:03,551][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:41:04,208][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:41:04,867][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:41:05,525][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:41:06,183][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:41:06,841][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:41:07,497][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:41:08,155][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:41:08,812][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:41:09,471][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:41:10,131][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:41:10,789][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:41:11,594][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 22:41:12,947][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:41:12,950][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:41:12,952][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:41:14,433][__main__][INFO] - Iteration 560 took 53s (10.46% Gen, 86.76% Train). Generation: 5s, Training: 46s. Estimated remaining time: 6h 15m 45s. Estimated total time: 14h 50m 54s. Time estimates for 10 more iterations: 8m 54s, 100 more iterations: 1h 29m 5s, 500 more iterations: 7h 25m 27s. [2026-03-25 22:41:14,436][__main__][INFO] - Starting iteration 560. [2026-03-25 22:41:14,439][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 22:41:14,440][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:41:19,411][__main__][INFO] - Number of regex retries in iteration 560: 0 [2026-03-25 22:41:19,412][__main__][INFO] - agents played in iteration 560 are Bob, Alice [2026-03-25 22:41:19,921][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:41:19,983][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:41:19,984][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:41:19,984][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:41:20,678][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:41:21,291][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:41:21,950][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:41:22,610][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:41:23,269][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:41:23,930][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:41:24,587][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:41:25,246][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:41:25,904][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:41:26,562][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:41:27,220][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:41:27,879][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:41:28,538][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:41:29,195][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:41:29,853][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:41:30,511][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:41:31,169][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:41:31,827][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:41:32,485][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:41:33,145][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:41:33,802][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:41:34,460][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:41:35,117][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:41:35,775][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:41:36,432][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:41:37,091][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:41:37,749][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:41:38,406][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:41:39,064][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:41:39,723][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:41:40,381][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:41:41,039][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:41:41,697][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:41:42,355][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:41:43,013][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:41:43,671][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:41:44,329][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:41:44,987][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:41:45,645][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:41:46,304][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:41:46,962][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:41:47,620][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:41:48,278][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:41:48,936][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:41:49,595][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:41:50,253][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:41:50,911][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:41:51,568][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:41:52,562][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:41:53,220][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:41:53,879][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:41:54,537][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:41:55,197][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:41:55,854][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:41:56,512][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:41:57,172][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:41:57,829][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:41:58,488][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:41:59,147][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:41:59,806][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:42:00,465][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:42:01,123][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:42:01,781][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:42:02,439][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:42:03,098][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:42:03,880][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 22:42:05,217][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:42:05,220][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:42:05,222][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:42:06,703][__main__][INFO] - Iteration 561 took 52s (9.51% Gen, 87.65% Train). Generation: 4s, Training: 45s. Estimated remaining time: 5h 55m 3s. Estimated total time: 14h 31m 4s. Time estimates for 10 more iterations: 8m 42s, 100 more iterations: 1h 27m 6s, 500 more iterations: 7h 15m 32s. [2026-03-25 22:42:06,705][__main__][INFO] - Starting iteration 561. [2026-03-25 22:42:06,709][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 22:42:06,709][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:42:11,574][__main__][INFO] - Number of regex retries in iteration 561: 0 [2026-03-25 22:42:11,575][__main__][INFO] - agents played in iteration 561 are Bob, Alice [2026-03-25 22:42:12,122][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:42:12,183][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:42:12,184][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:42:12,184][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:42:12,927][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:42:13,546][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:42:14,205][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:42:14,865][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:42:15,525][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:42:16,184][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:42:16,843][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:42:17,502][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:42:18,162][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:42:18,822][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:42:19,481][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:42:20,141][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:42:20,800][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:42:21,459][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:42:22,118][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:42:22,777][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:42:23,436][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:42:24,094][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:42:24,754][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:42:25,412][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:42:26,071][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:42:26,731][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:42:27,390][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:42:28,049][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:42:28,707][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:42:29,366][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:42:30,025][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:42:30,684][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:42:31,343][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:42:32,001][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:42:32,661][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:42:33,320][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:42:33,978][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:42:34,637][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:42:35,295][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:42:35,955][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:42:36,614][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:42:37,273][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:42:37,931][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:42:38,589][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:42:39,248][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:42:39,907][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:42:40,566][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:42:41,225][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:42:41,884][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:42:42,544][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:42:43,203][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:42:43,862][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:42:44,852][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:42:45,518][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:42:46,181][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:42:46,842][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:42:47,503][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:42:48,161][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:42:48,821][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:42:49,483][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:42:50,143][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:42:50,802][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:42:51,460][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:42:52,120][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:42:52,779][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:42:53,437][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:42:54,096][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:42:54,754][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:42:55,412][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:42:56,264][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 22:42:57,586][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:42:57,589][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:42:57,590][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:42:59,034][__main__][INFO] - Iteration 562 took 52s (9.30% Gen, 87.94% Train). Generation: 4s, Training: 46s. Estimated remaining time: 5h 55m 13s. Estimated total time: 14h 32m 7s. Time estimates for 10 more iterations: 8m 43s, 100 more iterations: 1h 27m 12s, 500 more iterations: 7h 16m 3s. [2026-03-25 22:42:59,037][__main__][INFO] - Starting iteration 562. [2026-03-25 22:42:59,042][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 22:42:59,042][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:43:06,173][__main__][INFO] - Number of regex retries in iteration 562: 0 [2026-03-25 22:43:06,174][__main__][INFO] - agents played in iteration 562 are Bob, Alice [2026-03-25 22:43:06,759][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:43:06,821][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:43:06,821][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:43:06,822][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:43:07,727][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:43:08,340][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:43:09,001][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:43:09,661][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:43:10,319][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:43:10,977][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:43:11,636][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:43:12,295][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:43:12,953][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:43:13,613][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:43:14,274][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:43:14,933][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:43:15,591][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:43:16,249][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:43:16,909][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:43:17,568][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:43:18,226][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:43:18,884][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:43:19,542][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:43:20,200][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:43:20,859][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:43:21,518][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:43:22,177][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:43:22,835][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:43:23,493][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:43:24,150][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:43:24,809][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:43:25,468][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:43:26,127][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:43:26,785][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:43:27,443][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:43:28,102][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:43:28,760][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:43:29,417][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:43:30,076][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:43:30,735][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:43:31,395][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:43:32,053][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:43:32,712][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:43:33,372][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:43:34,029][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:43:34,688][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:43:35,346][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:43:36,006][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:43:36,664][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:43:37,322][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:43:37,980][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:43:38,638][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:43:39,631][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:43:40,290][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:43:40,948][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:43:41,606][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:43:42,266][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:43:42,926][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:43:43,585][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:43:44,245][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:43:44,904][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:43:45,563][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:43:46,223][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:43:46,882][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:43:47,542][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:43:48,199][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:43:48,859][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:43:49,517][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:43:50,178][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:43:50,905][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 22:43:52,270][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:43:52,273][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:43:52,274][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:43:53,668][__main__][INFO] - Iteration 563 took 54s (13.06% Gen, 84.39% Train). Generation: 7s, Training: 46s. Estimated remaining time: 6h 32m 40s. Estimated total time: 15h 10m 28s. Time estimates for 10 more iterations: 9m 6s, 100 more iterations: 1h 31m 2s, 500 more iterations: 7h 35m 14s. [2026-03-25 22:43:53,670][__main__][INFO] - Starting iteration 563. [2026-03-25 22:43:53,674][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 22:43:53,674][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:44:06,257][__main__][INFO] - Number of regex retries in iteration 563: 0 [2026-03-25 22:44:06,259][__main__][INFO] - agents played in iteration 563 are Bob, Alice [2026-03-25 22:44:07,333][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:44:07,393][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:44:07,394][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:44:07,394][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:44:08,180][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:44:08,794][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:44:09,454][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:44:10,115][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:44:10,777][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:44:11,437][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:44:12,096][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:44:12,753][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:44:13,413][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:44:14,070][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:44:14,728][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:44:15,387][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:44:16,045][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:44:16,702][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:44:17,362][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:44:18,020][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:44:18,678][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:44:19,336][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:44:19,994][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:44:20,652][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:44:21,310][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:44:21,971][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:44:22,629][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:44:23,287][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:44:23,944][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:44:24,602][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:44:25,260][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:44:25,917][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:44:26,575][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:44:27,233][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:44:27,891][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:44:28,550][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:44:29,208][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:44:29,866][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:44:30,526][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:44:31,184][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:44:31,842][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:44:32,499][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:44:33,157][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:44:33,815][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:44:34,472][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:44:35,130][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:44:35,788][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:44:36,446][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:44:37,104][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:44:37,762][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:44:38,420][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:44:39,079][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:44:40,059][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:44:40,718][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:44:41,376][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:44:42,033][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:44:42,691][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:44:43,351][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:44:44,009][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:44:44,667][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:44:45,326][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:44:45,984][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:44:46,643][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:44:47,301][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:44:47,959][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:44:48,617][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:44:49,274][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:44:49,932][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:44:50,590][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:44:51,357][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 22:44:52,701][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:44:52,703][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:44:52,705][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:44:54,112][__main__][INFO] - Iteration 564 took 1m 0s (20.82% Gen, 76.85% Train). Generation: 12s, Training: 46s. Estimated remaining time: 8h 8m 31s. Estimated total time: 16h 47m 19s. Time estimates for 10 more iterations: 10m 4s, 100 more iterations: 1h 40m 43s, 500 more iterations: 8h 23m 39s. [2026-03-25 22:44:54,114][__main__][INFO] - Starting iteration 564. [2026-03-25 22:44:54,118][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 22:44:54,118][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:45:00,065][__main__][INFO] - Number of regex retries in iteration 564: 0 [2026-03-25 22:45:00,067][__main__][INFO] - agents played in iteration 564 are Bob, Alice [2026-03-25 22:45:00,583][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:45:00,647][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:45:00,647][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:45:00,648][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:45:01,517][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:45:02,134][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:45:02,794][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:45:03,453][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:45:04,112][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:45:04,769][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:45:05,428][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:45:06,088][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:45:06,748][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:45:07,406][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:45:08,064][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:45:08,723][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:45:09,382][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:45:10,041][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:45:10,700][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:45:11,359][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:45:12,020][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:45:12,678][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:45:13,337][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:45:13,995][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:45:14,654][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:45:15,312][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:45:15,970][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:45:16,628][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:45:17,285][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:45:17,944][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:45:18,604][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:45:19,262][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:45:19,920][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:45:20,577][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:45:21,236][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:45:21,893][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:45:22,551][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:45:23,209][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:45:23,867][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:45:24,525][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:45:25,185][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:45:25,843][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:45:26,501][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:45:27,159][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:45:27,818][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:45:28,477][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:45:29,136][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:45:29,795][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:45:30,454][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:45:31,113][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:45:31,773][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:45:32,433][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:45:33,426][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:45:34,085][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:45:34,744][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:45:35,402][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:45:36,064][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:45:36,720][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:45:37,379][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:45:38,037][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:45:38,696][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:45:39,355][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:45:40,014][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:45:40,672][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:45:41,332][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:45:41,991][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:45:42,650][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:45:43,308][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:45:43,967][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:45:44,805][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 22:45:46,173][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:45:46,176][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:45:46,177][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:45:47,659][__main__][INFO] - Iteration 565 took 53s (11.11% Gen, 86.12% Train). Generation: 5s, Training: 46s. Estimated remaining time: 6h 12m 40s. Estimated total time: 14h 52m 23s. Time estimates for 10 more iterations: 8m 55s, 100 more iterations: 1h 29m 14s, 500 more iterations: 7h 26m 11s. [2026-03-25 22:45:47,661][__main__][INFO] - Starting iteration 565. [2026-03-25 22:45:47,666][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 22:45:47,666][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:45:53,950][__main__][INFO] - Number of regex retries in iteration 565: 0 [2026-03-25 22:45:53,951][__main__][INFO] - agents played in iteration 565 are Bob, Alice [2026-03-25 22:45:54,440][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:45:54,500][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:45:54,501][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:45:54,501][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:45:55,260][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:45:55,877][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:45:56,539][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:45:57,199][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:45:57,860][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:45:58,519][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:45:59,183][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:45:59,844][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:46:00,502][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:46:01,160][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:46:01,820][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:46:02,479][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:46:03,138][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:46:03,796][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:46:04,455][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:46:05,113][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:46:05,771][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:46:06,429][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:46:07,088][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:46:07,746][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:46:08,404][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:46:09,062][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:46:09,720][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:46:10,378][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:46:11,038][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:46:11,696][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:46:12,357][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:46:13,017][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:46:13,676][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:46:14,334][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:46:14,995][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:46:15,654][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:46:16,314][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:46:16,972][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:46:17,632][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:46:18,291][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:46:18,950][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:46:19,608][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:46:20,268][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:46:20,927][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:46:21,585][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:46:22,244][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:46:22,902][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:46:23,561][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:46:24,219][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:46:24,877][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:46:25,535][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:46:26,193][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:46:27,192][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:46:27,850][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:46:28,509][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:46:29,168][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:46:29,828][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:46:30,487][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:46:31,146][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:46:31,805][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:46:32,463][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:46:33,123][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:46:33,781][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:46:34,439][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:46:35,099][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:46:35,758][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:46:36,416][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:46:37,073][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:46:37,733][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:46:38,526][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 22:46:39,894][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:46:39,897][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:46:39,898][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:46:41,191][__main__][INFO] - Iteration 566 took 53s (11.74% Gen, 85.84% Train). Generation: 6s, Training: 45s. Estimated remaining time: 6h 11m 32s. Estimated total time: 14h 52m 8s. Time estimates for 10 more iterations: 8m 55s, 100 more iterations: 1h 29m 12s, 500 more iterations: 7h 26m 4s. [2026-03-25 22:46:41,194][__main__][INFO] - Starting iteration 566. [2026-03-25 22:46:41,198][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 22:46:41,198][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:46:46,203][__main__][INFO] - Number of regex retries in iteration 566: 0 [2026-03-25 22:46:46,204][__main__][INFO] - agents played in iteration 566 are Bob, Alice [2026-03-25 22:46:46,820][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:46:46,881][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:46:46,881][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:46:46,882][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:46:47,796][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:46:48,477][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:46:49,137][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:46:49,796][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:46:50,456][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:46:51,114][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:46:51,773][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:46:52,432][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:46:53,091][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:46:53,750][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:46:54,410][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:46:55,070][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:46:55,729][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:46:56,388][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:46:57,047][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:46:57,707][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:46:58,366][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:46:59,025][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:46:59,684][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:47:00,343][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:47:01,002][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:47:01,661][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:47:02,321][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:47:02,980][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:47:03,638][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:47:04,297][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:47:04,958][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:47:05,615][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:47:06,274][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:47:06,933][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:47:07,593][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:47:08,253][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:47:08,913][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:47:09,571][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:47:10,231][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:47:10,890][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:47:11,549][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:47:12,208][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:47:12,867][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:47:13,526][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:47:14,185][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:47:14,844][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:47:15,502][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:47:16,161][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:47:16,821][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:47:17,480][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:47:18,139][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:47:18,798][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:47:19,824][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:47:20,484][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:47:21,141][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:47:21,800][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:47:22,459][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:47:23,118][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:47:23,778][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:47:24,436][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:47:25,094][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:47:25,752][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:47:26,412][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:47:27,071][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:47:27,731][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:47:28,390][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:47:29,048][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:47:29,709][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:47:30,368][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:47:31,193][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 22:47:32,522][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:47:32,525][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:47:32,526][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:47:34,018][__main__][INFO] - Iteration 567 took 52s (9.48% Gen, 87.69% Train). Generation: 5s, Training: 46s. Estimated remaining time: 5h 58m 53s. Estimated total time: 14h 40m 22s. Time estimates for 10 more iterations: 8m 48s, 100 more iterations: 1h 28m 2s, 500 more iterations: 7h 20m 11s. [2026-03-25 22:47:34,022][__main__][INFO] - Starting iteration 567. [2026-03-25 22:47:34,029][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 22:47:34,029][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:47:39,509][__main__][INFO] - Number of regex retries in iteration 567: 0 [2026-03-25 22:47:39,510][__main__][INFO] - agents played in iteration 567 are Bob, Alice [2026-03-25 22:47:40,378][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:47:40,440][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:47:40,440][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:47:40,441][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:47:41,215][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:47:41,836][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:47:42,498][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:47:43,158][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:47:43,817][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:47:44,477][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:47:45,136][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:47:45,794][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:47:46,454][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:47:47,112][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:47:47,771][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:47:48,430][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:47:49,089][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:47:49,747][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:47:50,407][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:47:51,066][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:47:51,725][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:47:52,384][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:47:53,043][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:47:53,703][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:47:54,361][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:47:55,021][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:47:55,681][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:47:56,340][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:47:56,999][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:47:57,658][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:47:58,317][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:47:58,977][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:47:59,636][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:48:00,297][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:48:00,958][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:48:01,619][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:48:02,279][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:48:02,941][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:48:03,600][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:48:04,260][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:48:04,920][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:48:05,580][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:48:06,239][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:48:06,899][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:48:07,561][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:48:08,221][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:48:08,880][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:48:09,540][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:48:10,200][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:48:10,859][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:48:11,520][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:48:12,180][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:48:13,175][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:48:13,834][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:48:14,494][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:48:15,154][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:48:15,813][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:48:16,471][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:48:17,129][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:48:17,788][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:48:18,445][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:48:19,104][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:48:19,763][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:48:20,422][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:48:21,081][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:48:21,739][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:48:22,398][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:48:23,057][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:48:23,715][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:48:24,483][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 22:48:25,838][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:48:25,841][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:48:25,842][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:48:27,313][__main__][INFO] - Iteration 568 took 53s (10.28% Gen, 86.95% Train). Generation: 5s, Training: 46s. Estimated remaining time: 6h 5m 44s. Estimated total time: 14h 48m 6s. Time estimates for 10 more iterations: 8m 52s, 100 more iterations: 1h 28m 48s, 500 more iterations: 7h 24m 3s. [2026-03-25 22:48:27,315][__main__][INFO] - Starting iteration 568. [2026-03-25 22:48:27,319][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 22:48:27,320][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:48:33,046][__main__][INFO] - Number of regex retries in iteration 568: 0 [2026-03-25 22:48:33,047][__main__][INFO] - agents played in iteration 568 are Bob, Alice [2026-03-25 22:48:33,660][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:48:33,722][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:48:33,723][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:48:33,723][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:48:34,485][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:48:35,101][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:48:35,762][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:48:36,421][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:48:37,081][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:48:37,741][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:48:38,399][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:48:39,057][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:48:39,718][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:48:40,375][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:48:41,034][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:48:41,693][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:48:42,352][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:48:43,010][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:48:43,669][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:48:44,328][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:48:44,987][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:48:45,646][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:48:46,303][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:48:46,962][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:48:47,622][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:48:48,280][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:48:48,939][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:48:49,598][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:48:50,256][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:48:50,917][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:48:51,576][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:48:52,234][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:48:52,890][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:48:53,549][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:48:54,208][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:48:54,867][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:48:55,527][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:48:56,186][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:48:56,845][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:48:57,504][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:48:58,165][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:48:58,823][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:48:59,490][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:49:00,148][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:49:00,806][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:49:01,464][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:49:02,123][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:49:02,782][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:49:03,440][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:49:04,098][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:49:04,757][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:49:05,416][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:49:06,405][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:49:07,064][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:49:07,724][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:49:08,384][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:49:09,043][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:49:09,703][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:49:10,361][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:49:11,020][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:49:11,680][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:49:12,340][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:49:13,000][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:49:13,659][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:49:14,318][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:49:14,977][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:49:15,635][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:49:16,293][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:49:16,952][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:49:17,738][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 22:49:19,085][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:49:19,087][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:49:19,089][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:49:20,475][__main__][INFO] - Iteration 569 took 53s (10.77% Gen, 86.61% Train). Generation: 5s, Training: 46s. Estimated remaining time: 6h 2m 42s. Estimated total time: 14h 45m 57s. Time estimates for 10 more iterations: 8m 51s, 100 more iterations: 1h 28m 35s, 500 more iterations: 7h 22m 58s. [2026-03-25 22:49:20,477][__main__][INFO] - Starting iteration 569. [2026-03-25 22:49:20,481][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 22:49:20,482][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:49:25,566][__main__][INFO] - Number of regex retries in iteration 569: 0 [2026-03-25 22:49:25,567][__main__][INFO] - agents played in iteration 569 are Bob, Alice [2026-03-25 22:49:26,070][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:49:26,132][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:49:26,133][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:49:26,133][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:49:26,802][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:49:27,426][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:49:28,088][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:49:28,747][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:49:29,406][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:49:30,068][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:49:30,728][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:49:31,388][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:49:32,047][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:49:32,711][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:49:33,369][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:49:34,030][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:49:34,690][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:49:35,350][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:49:36,009][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:49:36,669][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:49:37,327][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:49:37,986][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:49:38,646][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:49:39,305][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:49:39,964][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:49:40,625][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:49:41,283][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:49:41,942][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:49:42,600][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:49:43,260][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:49:43,918][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:49:44,577][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:49:45,238][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:49:45,897][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:49:46,557][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:49:47,215][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:49:47,874][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:49:48,533][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:49:49,193][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:49:49,851][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:49:50,513][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:49:51,172][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:49:51,833][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:49:52,494][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:49:53,153][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:49:53,813][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:49:54,472][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:49:55,131][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:49:55,790][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:49:56,449][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:49:57,108][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:49:57,767][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:49:58,764][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:49:59,423][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:50:00,081][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:50:00,739][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:50:01,399][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:50:02,057][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:50:02,716][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:50:03,373][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:50:04,033][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:50:04,690][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:50:05,351][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:50:06,008][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:50:06,666][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:50:07,325][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:50:07,984][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:50:08,642][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:50:09,302][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:50:10,098][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 22:50:11,505][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:50:11,508][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:50:11,510][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:50:12,935][__main__][INFO] - Iteration 570 took 52s (9.69% Gen, 87.58% Train). Generation: 5s, Training: 45s. Estimated remaining time: 5h 50m 8s. Estimated total time: 14h 34m 15s. Time estimates for 10 more iterations: 8m 44s, 100 more iterations: 1h 27m 25s, 500 more iterations: 7h 17m 7s. [2026-03-25 22:50:12,937][__main__][INFO] - Starting iteration 570. [2026-03-25 22:50:12,941][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 22:50:12,942][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:50:18,489][__main__][INFO] - Number of regex retries in iteration 570: 0 [2026-03-25 22:50:18,491][__main__][INFO] - agents played in iteration 570 are Bob, Alice [2026-03-25 22:50:18,976][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:50:19,039][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:50:19,039][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:50:19,040][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:50:19,708][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:50:20,328][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:50:20,991][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:50:21,655][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:50:22,315][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:50:22,976][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:50:23,636][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:50:24,295][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:50:24,955][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:50:25,615][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:50:26,274][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:50:26,934][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:50:27,593][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:50:28,253][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:50:28,911][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:50:29,570][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:50:30,229][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:50:30,890][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:50:31,550][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:50:32,209][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:50:32,870][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:50:33,532][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:50:34,194][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:50:34,857][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:50:35,516][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:50:36,177][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:50:36,838][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:50:37,499][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:50:38,159][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:50:38,822][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:50:39,483][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:50:40,144][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:50:40,803][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:50:41,464][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:50:42,123][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:50:42,782][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:50:43,441][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:50:44,100][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:50:44,758][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:50:45,416][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:50:46,074][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:50:46,733][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:50:47,392][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:50:48,052][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:50:48,711][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:50:49,370][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:50:50,028][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:50:50,688][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:50:51,676][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:50:52,334][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:50:52,992][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:50:53,649][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:50:54,307][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:50:54,965][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:50:55,624][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:50:56,282][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:50:56,941][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:50:57,599][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:50:58,257][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:50:58,916][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:50:59,576][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:51:00,234][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:51:00,894][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:51:01,553][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:51:02,212][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:51:03,133][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 22:51:04,571][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:51:04,573][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:51:04,575][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:51:06,069][__main__][INFO] - Iteration 571 took 53s (10.44% Gen, 86.74% Train). Generation: 5s, Training: 46s. Estimated remaining time: 6h 0m 29s. Estimated total time: 14h 45m 30s. Time estimates for 10 more iterations: 8m 51s, 100 more iterations: 1h 28m 33s, 500 more iterations: 7h 22m 45s. [2026-03-25 22:51:06,071][__main__][INFO] - Starting iteration 571. [2026-03-25 22:51:06,074][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 22:51:06,075][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:51:11,123][__main__][INFO] - Number of regex retries in iteration 571: 0 [2026-03-25 22:51:11,124][__main__][INFO] - agents played in iteration 571 are Bob, Alice [2026-03-25 22:51:11,730][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:51:11,791][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:51:11,792][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:51:11,793][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:51:12,459][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:51:13,375][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:51:13,981][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:51:14,639][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:51:15,299][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:51:15,959][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:51:16,617][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:51:17,276][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:51:17,935][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:51:18,593][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:51:19,251][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:51:19,912][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:51:20,570][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:51:21,228][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:51:21,887][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:51:22,546][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:51:23,204][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:51:23,862][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:51:24,520][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:51:25,178][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:51:25,836][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:51:26,495][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:51:27,154][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:51:27,812][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:51:28,470][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:51:29,129][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:51:29,789][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:51:30,444][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:51:31,102][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:51:31,760][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:51:32,419][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:51:33,077][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:51:33,736][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:51:34,394][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:51:35,052][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:51:35,712][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:51:36,372][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:51:37,031][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:51:37,691][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:51:38,349][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:51:39,010][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:51:39,670][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:51:40,330][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:51:40,989][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:51:41,648][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:51:42,307][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:51:42,970][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:51:43,631][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:51:44,639][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:51:45,300][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:51:45,959][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:51:46,618][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:51:47,278][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:51:47,937][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:51:48,597][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:51:49,255][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:51:49,915][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:51:50,573][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:51:51,233][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:51:51,892][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:51:52,551][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:51:53,210][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:51:53,869][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:51:54,534][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:51:55,192][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:51:55,985][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 22:51:57,312][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:51:57,315][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:51:57,316][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:51:58,844][__main__][INFO] - Iteration 572 took 52s (9.57% Gen, 87.53% Train). Generation: 5s, Training: 46s. Estimated remaining time: 5h 53m 37s. Estimated total time: 14h 39m 30s. Time estimates for 10 more iterations: 8m 47s, 100 more iterations: 1h 27m 57s, 500 more iterations: 7h 19m 45s. [2026-03-25 22:51:58,846][__main__][INFO] - Starting iteration 572. [2026-03-25 22:51:58,851][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 22:51:58,851][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:52:05,335][__main__][INFO] - Number of regex retries in iteration 572: 0 [2026-03-25 22:52:05,337][__main__][INFO] - agents played in iteration 572 are Bob, Alice [2026-03-25 22:52:06,264][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:52:06,327][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:52:06,328][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:52:06,328][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:52:07,019][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:52:07,626][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:52:08,286][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:52:08,947][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:52:09,607][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:52:10,267][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:52:10,925][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:52:11,584][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:52:12,243][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:52:12,901][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:52:13,560][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:52:14,219][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:52:14,877][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:52:15,536][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:52:16,195][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:52:16,853][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:52:17,512][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:52:18,173][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:52:18,831][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:52:19,491][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:52:20,149][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:52:20,807][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:52:21,466][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:52:22,125][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:52:22,783][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:52:23,442][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:52:24,101][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:52:24,759][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:52:25,417][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:52:26,077][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:52:26,735][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:52:27,393][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:52:28,051][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:52:28,710][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:52:29,369][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:52:30,033][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:52:30,692][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:52:31,351][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:52:32,009][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:52:32,667][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:52:33,325][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:52:33,983][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:52:34,642][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:52:35,300][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:52:35,958][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:52:36,616][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:52:37,275][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:52:37,933][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:52:38,922][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:52:39,583][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:52:40,243][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:52:40,900][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:52:41,559][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:52:42,218][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:52:42,878][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:52:43,536][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:52:44,194][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:52:44,854][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:52:45,512][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:52:46,171][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:52:46,832][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:52:47,491][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:52:48,150][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:52:48,808][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:52:49,467][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:52:50,247][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 22:52:51,583][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:52:51,586][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:52:51,587][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:52:53,055][__main__][INFO] - Iteration 573 took 54s (11.96% Gen, 85.32% Train). Generation: 6s, Training: 46s. Estimated remaining time: 6h 16m 38s. Estimated total time: 15h 3m 26s. Time estimates for 10 more iterations: 9m 2s, 100 more iterations: 1h 30m 20s, 500 more iterations: 7h 31m 43s. [2026-03-25 22:52:53,057][__main__][INFO] - Starting iteration 573. [2026-03-25 22:52:53,061][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 22:52:53,061][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:52:57,876][__main__][INFO] - Number of regex retries in iteration 573: 0 [2026-03-25 22:52:57,877][__main__][INFO] - agents played in iteration 573 are Bob, Alice [2026-03-25 22:52:58,378][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:52:58,442][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:52:58,443][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:52:58,444][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:52:59,172][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:52:59,777][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:53:00,437][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:53:01,095][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:53:01,756][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:53:02,421][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:53:03,081][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:53:03,739][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:53:04,400][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:53:05,058][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:53:05,716][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:53:06,374][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:53:07,033][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:53:07,691][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:53:08,350][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:53:09,008][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:53:09,667][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:53:10,325][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:53:10,984][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:53:11,642][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:53:12,300][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:53:12,958][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:53:13,616][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:53:14,275][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:53:14,933][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:53:15,591][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:53:16,249][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:53:16,908][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:53:17,567][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:53:18,226][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:53:18,885][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:53:19,544][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:53:20,202][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:53:20,863][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:53:21,522][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:53:22,180][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:53:22,839][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:53:23,499][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:53:24,157][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:53:24,816][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:53:25,476][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:53:26,139][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:53:26,797][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:53:27,456][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:53:28,116][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:53:28,775][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:53:29,435][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:53:30,095][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:53:31,076][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:53:31,738][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:53:32,396][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:53:33,055][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:53:33,716][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:53:34,375][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:53:35,033][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:53:35,692][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:53:36,353][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:53:37,019][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:53:37,678][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:53:38,336][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:53:38,995][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:53:39,654][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:53:40,313][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:53:40,971][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:53:41,630][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:53:42,419][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 22:53:43,808][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:53:43,810][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:53:43,812][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:53:45,354][__main__][INFO] - Iteration 574 took 52s (9.21% Gen, 87.84% Train). Generation: 4s, Training: 45s. Estimated remaining time: 5h 43m 55s. Estimated total time: 14h 31m 35s. Time estimates for 10 more iterations: 8m 42s, 100 more iterations: 1h 27m 9s, 500 more iterations: 7h 15m 47s. [2026-03-25 22:53:45,356][__main__][INFO] - Starting iteration 574. [2026-03-25 22:53:45,360][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 22:53:45,390][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:53:50,322][__main__][INFO] - Number of regex retries in iteration 574: 0 [2026-03-25 22:53:50,324][__main__][INFO] - agents played in iteration 574 are Bob, Alice [2026-03-25 22:53:50,915][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:53:50,984][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:53:50,986][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:53:50,987][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:53:51,860][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:53:52,475][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:53:53,138][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:53:53,799][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:53:54,458][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:53:55,119][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:53:55,777][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:53:56,436][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:53:57,096][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:53:57,754][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:53:58,415][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:53:59,075][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:53:59,733][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:54:00,396][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:54:01,056][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:54:01,714][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:54:02,373][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:54:03,033][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:54:03,692][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:54:04,351][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:54:05,010][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:54:05,669][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:54:06,329][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:54:06,988][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:54:07,647][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:54:08,306][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:54:08,966][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:54:09,625][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:54:10,285][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:54:10,944][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:54:11,603][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:54:12,263][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:54:12,923][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:54:13,583][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:54:14,242][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:54:14,901][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:54:15,560][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:54:16,220][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:54:16,879][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:54:17,538][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:54:18,197][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:54:18,856][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:54:19,514][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:54:20,173][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:54:20,831][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:54:21,490][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:54:22,149][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:54:22,808][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:54:23,806][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:54:24,466][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:54:25,124][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:54:25,783][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:54:26,440][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:54:27,099][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:54:27,756][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:54:28,414][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:54:29,073][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:54:29,731][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:54:30,389][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:54:31,050][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:54:31,710][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:54:32,370][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:54:33,029][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:54:33,689][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:54:34,348][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:54:35,216][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 22:54:36,566][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:54:36,569][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:54:36,570][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:54:38,048][__main__][INFO] - Iteration 575 took 52s (9.36% Gen, 87.77% Train). Generation: 4s, Training: 46s. Estimated remaining time: 5h 49m 37s. Estimated total time: 14h 38m 10s. Time estimates for 10 more iterations: 8m 46s, 100 more iterations: 1h 27m 49s, 500 more iterations: 7h 19m 5s. [2026-03-25 22:54:38,051][__main__][INFO] - Starting iteration 575. [2026-03-25 22:54:38,055][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 22:54:38,055][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:54:42,803][__main__][INFO] - Number of regex retries in iteration 575: 0 [2026-03-25 22:54:42,804][__main__][INFO] - agents played in iteration 575 are Bob, Alice [2026-03-25 22:54:43,332][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:54:43,401][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:54:43,402][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:54:43,402][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:54:44,082][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:54:44,700][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:54:45,360][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:54:46,019][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:54:46,679][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:54:47,338][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:54:47,997][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:54:48,657][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:54:49,317][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:54:49,976][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:54:50,635][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:54:51,294][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:54:51,954][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:54:52,614][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:54:53,273][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:54:53,933][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:54:54,593][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:54:55,253][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:54:55,912][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:54:56,571][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:54:57,230][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:54:57,889][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:54:58,549][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:54:59,212][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:54:59,871][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:55:00,530][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:55:01,190][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:55:01,850][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:55:02,509][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:55:03,168][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:55:03,827][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:55:04,486][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:55:05,144][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:55:05,804][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:55:06,464][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:55:07,124][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:55:07,784][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:55:08,443][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:55:09,102][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:55:09,763][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:55:10,423][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:55:11,081][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:55:11,740][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:55:12,400][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:55:13,059][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:55:13,717][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:55:14,376][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:55:15,038][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:55:16,025][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:55:16,684][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:55:17,343][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:55:18,002][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:55:18,661][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:55:19,320][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:55:19,978][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:55:20,638][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:55:21,299][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:55:21,958][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:55:22,616][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:55:23,274][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:55:23,932][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:55:24,591][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:55:25,250][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:55:25,908][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:55:26,569][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:55:27,345][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 22:55:28,681][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:55:28,684][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:55:28,686][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:55:30,133][__main__][INFO] - Iteration 576 took 52s (9.12% Gen, 88.10% Train). Generation: 4s, Training: 45s. Estimated remaining time: 5h 38m 34s. Estimated total time: 14h 27m 59s. Time estimates for 10 more iterations: 8m 40s, 100 more iterations: 1h 26m 47s, 500 more iterations: 7h 13m 59s. [2026-03-25 22:55:30,136][__main__][INFO] - Starting iteration 576. [2026-03-25 22:55:30,140][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 22:55:30,140][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:55:35,714][__main__][INFO] - Number of regex retries in iteration 576: 0 [2026-03-25 22:55:35,715][__main__][INFO] - agents played in iteration 576 are Bob, Alice [2026-03-25 22:55:36,489][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:55:36,550][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:55:36,551][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:55:36,552][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:55:37,376][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:55:37,995][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:55:38,656][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:55:39,314][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:55:39,972][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:55:40,630][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:55:41,290][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:55:41,948][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:55:42,608][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:55:43,266][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:55:43,926][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:55:44,585][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:55:45,244][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:55:45,902][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:55:46,561][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:55:47,224][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:55:47,882][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:55:48,540][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:55:49,197][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:55:49,856][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:55:50,515][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:55:51,173][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:55:51,830][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:55:52,489][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:55:53,147][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:55:53,805][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:55:54,463][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:55:55,122][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:55:55,779][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:55:56,437][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:55:57,095][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:55:57,753][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:55:58,410][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:55:59,068][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:55:59,726][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:56:00,385][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:56:01,043][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:56:01,701][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:56:02,359][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:56:03,017][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:56:03,676][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:56:04,334][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:56:04,992][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:56:05,650][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:56:06,308][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:56:06,966][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:56:07,623][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:56:08,281][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:56:09,271][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:56:09,933][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:56:10,592][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:56:11,251][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:56:11,910][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:56:12,568][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:56:13,226][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:56:13,886][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:56:14,546][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:56:15,203][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:56:15,861][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:56:16,520][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:56:17,179][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:56:17,838][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:56:18,497][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:56:19,155][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:56:19,816][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:56:20,603][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 22:56:21,946][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:56:21,948][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:56:21,950][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:56:23,394][__main__][INFO] - Iteration 577 took 53s (10.47% Gen, 86.81% Train). Generation: 5s, Training: 46s. Estimated remaining time: 5h 57m 18s. Estimated total time: 14h 47m 36s. Time estimates for 10 more iterations: 8m 52s, 100 more iterations: 1h 28m 45s, 500 more iterations: 7h 23m 48s. [2026-03-25 22:56:23,397][__main__][INFO] - Starting iteration 577. [2026-03-25 22:56:23,401][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 22:56:23,402][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:56:32,229][__main__][INFO] - Number of regex retries in iteration 577: 0 [2026-03-25 22:56:32,231][__main__][INFO] - agents played in iteration 577 are Bob, Alice [2026-03-25 22:56:32,797][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:56:32,858][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:56:32,858][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:56:32,859][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:56:33,522][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:56:34,135][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:56:34,793][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:56:35,454][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:56:36,113][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:56:36,773][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:56:37,432][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:56:38,089][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:56:38,748][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:56:39,407][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:56:40,066][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:56:40,723][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:56:41,381][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:56:42,040][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:56:42,698][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:56:43,356][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:56:44,014][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:56:44,673][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:56:45,331][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:56:45,989][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:56:46,647][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:56:47,306][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:56:47,964][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:56:48,622][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:56:49,279][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:56:49,938][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:56:50,596][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:56:51,254][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:56:51,912][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:56:52,570][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:56:53,227][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:56:53,885][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:56:54,544][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:56:55,203][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:56:55,862][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:56:56,521][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:56:57,179][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:56:57,837][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:56:58,496][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:56:59,154][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:56:59,813][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:57:00,471][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:57:01,130][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:57:01,788][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:57:02,446][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:57:03,105][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:57:03,762][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:57:04,422][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:57:05,407][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:57:06,066][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:57:06,725][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:57:07,385][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:57:08,043][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:57:08,701][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:57:09,361][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:57:10,020][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:57:10,679][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:57:11,339][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:57:12,000][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:57:12,659][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:57:13,320][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:57:13,979][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:57:14,638][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:57:15,297][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:57:15,958][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:57:16,737][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 22:57:18,132][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:57:18,136][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:57:18,137][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:57:19,454][__main__][INFO] - Iteration 578 took 56s (15.75% Gen, 81.89% Train). Generation: 8s, Training: 45s. Estimated remaining time: 6h 43m 1s. Estimated total time: 15h 34m 15s. Time estimates for 10 more iterations: 9m 20s, 100 more iterations: 1h 33m 25s, 500 more iterations: 7h 47m 7s. [2026-03-25 22:57:19,457][__main__][INFO] - Starting iteration 578. [2026-03-25 22:57:19,461][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 22:57:19,461][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:57:25,607][__main__][INFO] - Number of regex retries in iteration 578: 0 [2026-03-25 22:57:25,607][__main__][INFO] - agents played in iteration 578 are Bob, Alice [2026-03-25 22:57:26,126][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:57:26,188][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:57:26,189][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:57:26,189][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:57:26,868][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:57:27,491][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:57:28,147][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:57:28,807][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:57:29,467][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:57:30,127][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:57:30,787][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:57:31,446][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:57:32,104][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:57:32,762][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:57:33,421][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:57:34,079][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:57:34,739][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:57:35,398][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:57:36,055][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:57:36,713][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:57:37,372][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:57:38,032][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:57:38,690][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:57:39,348][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:57:40,007][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:57:40,665][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:57:41,323][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:57:41,981][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:57:42,639][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:57:43,296][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:57:43,955][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:57:44,614][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:57:45,272][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:57:45,932][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:57:46,594][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:57:47,254][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:57:47,912][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:57:48,571][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:57:49,230][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:57:49,889][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:57:50,547][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:57:51,205][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:57:51,864][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:57:52,523][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:57:53,182][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:57:53,841][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:57:54,501][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:57:55,161][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:57:55,820][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:57:56,479][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:57:57,138][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:57:57,796][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:57:58,781][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:57:59,447][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:58:00,106][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:58:00,765][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:58:01,424][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:58:02,082][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:58:02,740][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:58:03,399][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:58:04,058][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:58:04,716][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:58:05,376][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:58:06,034][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:58:06,693][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:58:07,351][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:58:08,010][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:58:08,669][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:58:09,328][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:58:10,156][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 22:58:11,506][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:58:11,509][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:58:11,510][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:58:12,953][__main__][INFO] - Iteration 579 took 53s (11.49% Gen, 85.81% Train). Generation: 6s, Training: 45s. Estimated remaining time: 5h 59m 25s. Estimated total time: 14h 51m 33s. Time estimates for 10 more iterations: 8m 54s, 100 more iterations: 1h 29m 9s, 500 more iterations: 7h 25m 46s. [2026-03-25 22:58:12,955][__main__][INFO] - Starting iteration 579. [2026-03-25 22:58:12,959][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 22:58:12,959][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:58:18,509][__main__][INFO] - Number of regex retries in iteration 579: 0 [2026-03-25 22:58:18,510][__main__][INFO] - agents played in iteration 579 are Bob, Alice [2026-03-25 22:58:19,013][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:58:19,075][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:58:19,076][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:58:19,077][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:58:19,746][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:58:20,363][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:58:21,022][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:58:21,680][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:58:22,338][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:58:22,996][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:58:23,654][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:58:24,313][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:58:24,972][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:58:25,629][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:58:26,287][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:58:26,946][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:58:27,673][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:58:28,329][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:58:28,988][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:58:29,648][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:58:30,306][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:58:30,964][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:58:31,623][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:58:32,282][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:58:32,940][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:58:33,600][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:58:34,259][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:58:34,918][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:58:35,577][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:58:36,235][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:58:36,894][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:58:37,553][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:58:38,212][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:58:38,871][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:58:39,530][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:58:40,189][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:58:40,847][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:58:41,505][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:58:42,163][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:58:42,821][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:58:43,480][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:58:44,138][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:58:44,796][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:58:45,455][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:58:46,113][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:58:46,771][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:58:47,429][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:58:48,087][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:58:48,745][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:58:49,403][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:58:50,066][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:58:50,724][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:58:51,725][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:58:52,384][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:58:53,042][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:58:53,700][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:58:54,360][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:58:55,020][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:58:55,679][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:58:56,338][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:58:56,996][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:58:57,654][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:58:58,315][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:58:58,974][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:58:59,632][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:59:00,290][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:59:00,948][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:59:01,607][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:59:02,267][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:59:03,067][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 22:59:04,408][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:59:04,410][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:59:04,412][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:59:05,831][__main__][INFO] - Iteration 580 took 52s (10.50% Gen, 86.81% Train). Generation: 5s, Training: 45s. Estimated remaining time: 5h 48m 13s. Estimated total time: 14h 41m 13s. Time estimates for 10 more iterations: 8m 48s, 100 more iterations: 1h 28m 7s, 500 more iterations: 7h 20m 36s. [2026-03-25 22:59:05,833][__main__][INFO] - Starting iteration 580. [2026-03-25 22:59:05,842][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 22:59:05,843][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:59:10,552][__main__][INFO] - Number of regex retries in iteration 580: 0 [2026-03-25 22:59:10,552][__main__][INFO] - agents played in iteration 580 are Bob, Alice [2026-03-25 22:59:11,057][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:59:11,120][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 22:59:11,121][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:59:11,121][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:59:11,792][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:59:12,409][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:59:13,070][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:59:13,729][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:59:14,389][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:59:15,048][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:59:15,709][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:59:16,370][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:59:17,030][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:59:17,691][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:59:18,353][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:59:19,013][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:59:19,673][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:59:20,333][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:59:20,992][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:59:21,654][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:59:22,315][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:59:22,976][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:59:23,636][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:59:24,296][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:59:24,956][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:59:25,616][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:59:26,276][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:59:26,937][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:59:27,597][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:59:28,259][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:59:28,921][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:59:29,580][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:59:30,240][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:59:30,901][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:59:31,563][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:59:32,222][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:59:32,880][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:59:33,539][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:59:34,199][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:59:34,860][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:59:35,520][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:59:36,181][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:59:36,840][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:59:37,500][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:59:38,160][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:59:38,819][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:59:39,479][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:59:40,138][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:59:40,798][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:59:41,458][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:59:42,119][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:59:42,778][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:59:43,765][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:59:44,425][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:59:45,088][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:59:45,747][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:59:46,407][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:59:47,066][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:59:47,726][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:59:48,384][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:59:49,044][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:59:49,702][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:59:50,361][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:59:51,020][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:59:51,678][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:59:52,336][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:59:52,995][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:59:53,653][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:59:54,312][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:59:55,100][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 22:59:56,444][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:59:56,446][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:59:56,448][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:59:57,854][__main__][INFO] - Iteration 581 took 52s (9.05% Gen, 88.24% Train). Generation: 4s, Training: 45s. Estimated remaining time: 5h 33m 1s. Estimated total time: 14h 26m 54s. Time estimates for 10 more iterations: 8m 40s, 100 more iterations: 1h 26m 41s, 500 more iterations: 7h 13m 27s. [2026-03-25 22:59:57,856][__main__][INFO] - Starting iteration 581. [2026-03-25 22:59:57,860][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 22:59:57,860][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:00:07,905][__main__][INFO] - Number of regex retries in iteration 581: 0 [2026-03-25 23:00:07,906][__main__][INFO] - agents played in iteration 581 are Bob, Alice [2026-03-25 23:00:08,943][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:00:09,005][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:00:09,005][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:00:09,006][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:00:09,808][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:00:10,423][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:00:11,082][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:00:11,741][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:00:12,399][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:00:13,058][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:00:13,716][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:00:14,374][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:00:15,032][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:00:15,697][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:00:16,356][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:00:17,014][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:00:17,673][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:00:18,333][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:00:18,991][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:00:19,649][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:00:20,308][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:00:20,966][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:00:21,625][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:00:22,283][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:00:22,943][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:00:23,604][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:00:24,263][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:00:24,922][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:00:25,581][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:00:26,239][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:00:26,899][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:00:27,558][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:00:28,217][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:00:28,876][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:00:29,534][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:00:30,192][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:00:30,850][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:00:31,509][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:00:32,171][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:00:32,831][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:00:33,489][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:00:34,148][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:00:34,807][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:00:35,465][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:00:36,123][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:00:36,782][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:00:37,441][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:00:38,099][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:00:38,757][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:00:39,416][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:00:40,076][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:00:40,734][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:00:41,727][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:00:42,386][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:00:43,046][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:00:43,705][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:00:44,363][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:00:45,021][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:00:45,680][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:00:46,341][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:00:47,006][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:00:47,667][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:00:48,326][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:00:48,985][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:00:49,644][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:00:50,304][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:00:50,963][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:00:51,622][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:00:52,281][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:00:53,066][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 23:00:54,397][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:00:54,399][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:00:54,401][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:00:55,883][__main__][INFO] - Iteration 582 took 58s (17.31% Gen, 80.13% Train). Generation: 10s, Training: 46s. Estimated remaining time: 7h 12m 14s. Estimated total time: 16h 7m 4s. Time estimates for 10 more iterations: 9m 40s, 100 more iterations: 1h 36m 42s, 500 more iterations: 8h 3m 32s. [2026-03-25 23:00:55,885][__main__][INFO] - Starting iteration 582. [2026-03-25 23:00:55,889][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 23:00:55,889][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:01:01,161][__main__][INFO] - Number of regex retries in iteration 582: 0 [2026-03-25 23:01:01,162][__main__][INFO] - agents played in iteration 582 are Bob, Alice [2026-03-25 23:01:01,699][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:01:01,762][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:01:01,763][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:01:01,763][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:01:02,424][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:01:03,032][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:01:03,691][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:01:04,350][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:01:05,009][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:01:05,667][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:01:06,325][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:01:06,983][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:01:07,642][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:01:08,300][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:01:08,957][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:01:09,615][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:01:10,275][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:01:10,934][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:01:11,592][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:01:12,249][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:01:12,907][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:01:13,564][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:01:14,222][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:01:14,881][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:01:15,540][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:01:16,198][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:01:16,857][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:01:17,516][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:01:18,174][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:01:18,833][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:01:19,492][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:01:20,150][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:01:20,808][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:01:21,467][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:01:22,125][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:01:22,783][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:01:23,441][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:01:24,100][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:01:24,758][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:01:25,416][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:01:26,073][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:01:26,732][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:01:27,390][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:01:28,048][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:01:28,707][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:01:29,366][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:01:30,024][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:01:30,682][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:01:31,341][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:01:32,000][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:01:32,659][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:01:33,318][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:01:34,301][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:01:34,960][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:01:35,618][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:01:36,277][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:01:36,938][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:01:37,597][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:01:38,255][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:01:38,913][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:01:39,572][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:01:40,231][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:01:40,889][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:01:41,547][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:01:42,206][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:01:42,869][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:01:43,527][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:01:44,187][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:01:44,848][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:01:45,664][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 23:01:47,003][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:01:47,005][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:01:47,006][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:01:48,456][__main__][INFO] - Iteration 583 took 52s (10.03% Gen, 87.21% Train). Generation: 5s, Training: 45s. Estimated remaining time: 5h 40m 26s. Estimated total time: 14h 36m 9s. Time estimates for 10 more iterations: 8m 45s, 100 more iterations: 1h 27m 36s, 500 more iterations: 7h 18m 4s. [2026-03-25 23:01:48,458][__main__][INFO] - Starting iteration 583. [2026-03-25 23:01:48,462][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 23:01:48,463][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:01:53,183][__main__][INFO] - Number of regex retries in iteration 583: 0 [2026-03-25 23:01:53,185][__main__][INFO] - agents played in iteration 583 are Bob, Alice [2026-03-25 23:01:53,672][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:01:53,735][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:01:53,736][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:01:53,736][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:01:54,515][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:01:55,130][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:01:55,793][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:01:56,452][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:01:57,113][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:01:57,772][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:01:58,432][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:01:59,089][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:01:59,748][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:02:00,406][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:02:01,068][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:02:01,726][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:02:02,386][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:02:03,045][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:02:03,704][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:02:04,362][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:02:05,020][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:02:05,680][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:02:06,338][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:02:06,996][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:02:07,656][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:02:08,314][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:02:08,973][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:02:09,631][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:02:10,290][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:02:10,950][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:02:11,609][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:02:12,268][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:02:12,926][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:02:13,585][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:02:14,243][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:02:14,902][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:02:15,561][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:02:16,220][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:02:16,879][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:02:17,541][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:02:18,202][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:02:18,860][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:02:19,518][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:02:20,177][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:02:20,836][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:02:21,499][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:02:22,157][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:02:22,815][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:02:23,473][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:02:24,131][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:02:24,789][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:02:25,447][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:02:26,456][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:02:27,115][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:02:27,775][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:02:28,434][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:02:29,094][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:02:29,753][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:02:30,412][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:02:31,071][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:02:31,729][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:02:32,387][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:02:33,046][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:02:33,705][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:02:34,363][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:02:35,021][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:02:35,680][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:02:36,339][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:02:36,997][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:02:37,778][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 23:02:39,141][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:02:39,144][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:02:39,145][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:02:40,619][__main__][INFO] - Iteration 584 took 52s (9.05% Gen, 88.12% Train). Generation: 4s, Training: 45s. Estimated remaining time: 5h 32m 43s. Estimated total time: 14h 29m 18s. Time estimates for 10 more iterations: 8m 41s, 100 more iterations: 1h 26m 55s, 500 more iterations: 7h 14m 39s. [2026-03-25 23:02:40,621][__main__][INFO] - Starting iteration 584. [2026-03-25 23:02:40,624][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 23:02:40,625][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:02:48,301][__main__][INFO] - Number of regex retries in iteration 584: 0 [2026-03-25 23:02:48,303][__main__][INFO] - agents played in iteration 584 are Bob, Alice [2026-03-25 23:02:48,793][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:02:48,855][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:02:48,855][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:02:48,856][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:02:49,545][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:02:50,149][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:02:50,809][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:02:51,469][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:02:52,128][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:02:52,786][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:02:53,445][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:02:54,106][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:02:54,766][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:02:55,424][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:02:56,084][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:02:56,743][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:02:57,402][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:02:58,060][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:02:58,718][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:02:59,377][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:03:00,035][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:03:00,693][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:03:01,352][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:03:02,011][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:03:02,671][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:03:03,330][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:03:03,988][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:03:04,647][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:03:05,305][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:03:05,963][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:03:06,622][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:03:07,280][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:03:07,945][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:03:08,604][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:03:09,262][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:03:09,921][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:03:10,580][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:03:11,239][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:03:11,897][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:03:12,554][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:03:13,212][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:03:13,871][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:03:14,528][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:03:15,186][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:03:15,844][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:03:16,502][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:03:17,160][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:03:17,818][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:03:18,476][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:03:19,134][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:03:19,792][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:03:20,450][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:03:21,434][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:03:22,093][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:03:22,751][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:03:23,409][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:03:24,068][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:03:24,727][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:03:25,385][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:03:26,043][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:03:26,702][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:03:27,359][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:03:28,017][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:03:28,676][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:03:29,334][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:03:29,993][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:03:30,651][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:03:31,308][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:03:31,967][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:03:32,717][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 23:03:34,033][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:03:34,037][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:03:34,038][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:03:35,430][__main__][INFO] - Iteration 585 took 54s (14.01% Gen, 83.45% Train). Generation: 7s, Training: 45s. Estimated remaining time: 6h 15m 57s. Estimated total time: 15h 13m 27s. Time estimates for 10 more iterations: 9m 8s, 100 more iterations: 1h 31m 20s, 500 more iterations: 7h 36m 43s. [2026-03-25 23:03:35,432][__main__][INFO] - Starting iteration 585. [2026-03-25 23:03:35,436][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 23:03:35,436][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:03:40,636][__main__][INFO] - Number of regex retries in iteration 585: 0 [2026-03-25 23:03:40,637][__main__][INFO] - agents played in iteration 585 are Bob, Alice [2026-03-25 23:03:41,705][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:03:41,766][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:03:41,767][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:03:41,767][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:03:42,589][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:03:43,205][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:03:43,866][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:03:44,525][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:03:45,184][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:03:45,843][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:03:46,502][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:03:47,161][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:03:47,821][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:03:48,480][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:03:49,138][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:03:49,798][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:03:50,457][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:03:51,116][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:03:51,775][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:03:52,433][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:03:53,090][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:03:53,749][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:03:54,407][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:03:55,066][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:03:55,725][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:03:56,385][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:03:57,044][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:03:57,701][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:03:58,360][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:03:59,019][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:03:59,678][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:04:00,337][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:04:00,997][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:04:01,655][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:04:02,315][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:04:02,973][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:04:03,632][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:04:04,291][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:04:04,949][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:04:05,608][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:04:06,266][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:04:06,924][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:04:07,582][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:04:08,240][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:04:08,898][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:04:09,556][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:04:10,215][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:04:10,873][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:04:11,531][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:04:12,189][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:04:12,848][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:04:13,507][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:04:14,515][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:04:15,175][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:04:15,834][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:04:16,493][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:04:17,152][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:04:17,810][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:04:18,468][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:04:19,127][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:04:19,786][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:04:20,445][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:04:21,103][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:04:21,761][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:04:22,421][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:04:23,080][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:04:23,739][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:04:24,399][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:04:25,064][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:04:25,876][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 23:04:27,197][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:04:27,200][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:04:27,267][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:04:28,672][__main__][INFO] - Iteration 586 took 53s (9.77% Gen, 87.59% Train). Generation: 5s, Training: 46s. Estimated remaining time: 5h 48m 55s. Estimated total time: 14h 47m 18s. Time estimates for 10 more iterations: 8m 52s, 100 more iterations: 1h 28m 43s, 500 more iterations: 7h 23m 39s. [2026-03-25 23:04:28,674][__main__][INFO] - Starting iteration 586. [2026-03-25 23:04:28,680][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 23:04:28,680][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:04:34,500][__main__][INFO] - Number of regex retries in iteration 586: 0 [2026-03-25 23:04:34,501][__main__][INFO] - agents played in iteration 586 are Bob, Alice [2026-03-25 23:04:35,063][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:04:35,125][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:04:35,126][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:04:35,126][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:04:35,779][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:04:36,387][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:04:37,048][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:04:37,707][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:04:38,368][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:04:39,026][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:04:39,685][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:04:40,343][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:04:41,002][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:04:41,661][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:04:42,319][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:04:42,978][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:04:43,636][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:04:44,295][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:04:44,954][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:04:45,611][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:04:46,269][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:04:46,928][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:04:47,586][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:04:48,244][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:04:48,905][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:04:49,563][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:04:50,221][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:04:50,881][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:04:51,539][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:04:52,196][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:04:52,854][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:04:53,511][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:04:54,169][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:04:54,829][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:04:55,486][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:04:56,145][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:04:56,803][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:04:57,460][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:04:58,119][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:04:58,778][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:04:59,436][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:05:00,095][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:05:00,753][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:05:01,412][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:05:02,070][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:05:02,728][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:05:03,385][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:05:04,043][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:05:04,702][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:05:05,361][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:05:06,019][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:05:06,678][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:05:07,690][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:05:08,350][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:05:09,009][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:05:09,668][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:05:10,328][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:05:10,988][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:05:11,649][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:05:12,309][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:05:12,967][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:05:13,626][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:05:14,284][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:05:14,942][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:05:15,601][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:05:16,259][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:05:16,917][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:05:17,576][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:05:18,234][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:05:19,039][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 23:05:20,381][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:05:20,384][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:05:20,385][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:05:21,760][__main__][INFO] - Iteration 587 took 53s (10.97% Gen, 86.44% Train). Generation: 5s, Training: 45s. Estimated remaining time: 5h 45m 26s. Estimated total time: 14h 44m 42s. Time estimates for 10 more iterations: 8m 50s, 100 more iterations: 1h 28m 28s, 500 more iterations: 7h 22m 21s. [2026-03-25 23:05:21,762][__main__][INFO] - Starting iteration 587. [2026-03-25 23:05:21,765][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 23:05:21,766][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:05:31,804][__main__][INFO] - Number of regex retries in iteration 587: 0 [2026-03-25 23:05:31,806][__main__][INFO] - agents played in iteration 587 are Bob, Alice [2026-03-25 23:05:32,315][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:05:32,377][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:05:32,377][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:05:32,378][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:05:33,259][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:05:33,866][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:05:34,527][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:05:35,186][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:05:35,846][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:05:36,507][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:05:37,167][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:05:37,827][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:05:38,487][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:05:39,146][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:05:39,806][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:05:40,467][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:05:41,127][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:05:41,787][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:05:42,447][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:05:43,107][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:05:43,765][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:05:44,425][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:05:45,085][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:05:45,744][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:05:46,403][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:05:47,062][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:05:47,722][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:05:48,380][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:05:49,038][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:05:49,698][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:05:50,357][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:05:51,016][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:05:51,676][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:05:52,336][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:05:52,995][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:05:53,655][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:05:54,314][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:05:54,974][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:05:55,634][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:05:56,292][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:05:56,951][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:05:57,610][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:05:58,270][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:05:58,935][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:05:59,594][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:06:00,253][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:06:00,913][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:06:01,574][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:06:02,235][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:06:02,894][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:06:03,553][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:06:04,212][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:06:05,236][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:06:05,896][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:06:06,554][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:06:07,214][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:06:07,873][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:06:08,531][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:06:09,190][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:06:09,849][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:06:10,508][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:06:11,171][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:06:11,831][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:06:12,489][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:06:13,148][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:06:13,806][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:06:14,465][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:06:15,124][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:06:15,782][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:06:16,567][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 23:06:17,893][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:06:17,896][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:06:17,898][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:06:19,411][__main__][INFO] - Iteration 588 took 57s (17.42% Gen, 79.96% Train). Generation: 10s, Training: 46s. Estimated remaining time: 7h 0m 33s. Estimated total time: 16h 0m 47s. Time estimates for 10 more iterations: 9m 36s, 100 more iterations: 1h 36m 4s, 500 more iterations: 8h 0m 23s. [2026-03-25 23:06:19,414][__main__][INFO] - Starting iteration 588. [2026-03-25 23:06:19,418][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 23:06:19,419][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:06:24,150][__main__][INFO] - Number of regex retries in iteration 588: 0 [2026-03-25 23:06:24,151][__main__][INFO] - agents played in iteration 588 are Bob, Alice [2026-03-25 23:06:24,636][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:06:24,698][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:06:24,699][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:06:24,699][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:06:25,369][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:06:25,982][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:06:26,642][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:06:27,301][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:06:27,961][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:06:28,621][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:06:29,280][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:06:29,941][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:06:30,600][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:06:31,260][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:06:31,919][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:06:32,580][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:06:33,239][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:06:33,901][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:06:34,560][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:06:35,219][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:06:35,879][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:06:36,540][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:06:37,199][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:06:37,859][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:06:38,518][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:06:39,179][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:06:39,839][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:06:40,499][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:06:41,160][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:06:41,819][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:06:42,479][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:06:43,138][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:06:43,797][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:06:44,456][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:06:45,115][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:06:45,775][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:06:46,434][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:06:47,093][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:06:47,752][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:06:48,413][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:06:49,073][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:06:49,732][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:06:50,391][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:06:51,050][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:06:51,709][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:06:52,368][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:06:53,026][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:06:53,686][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:06:54,345][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:06:55,003][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:06:55,663][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:06:56,323][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:06:57,309][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:06:57,968][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:06:58,626][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:06:59,285][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:06:59,945][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:07:00,603][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:07:01,263][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:07:01,922][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:07:02,581][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:07:03,240][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:07:03,898][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:07:04,557][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:07:05,217][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:07:05,874][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:07:06,532][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:07:07,190][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:07:07,848][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:07:08,609][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 23:07:09,961][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:07:09,964][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:07:09,965][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:07:11,364][__main__][INFO] - Iteration 589 took 51s (9.11% Gen, 88.19% Train). Generation: 4s, Training: 45s. Estimated remaining time: 5h 24m 41s. Estimated total time: 14h 25m 47s. Time estimates for 10 more iterations: 8m 39s, 100 more iterations: 1h 26m 34s, 500 more iterations: 7h 12m 53s. [2026-03-25 23:07:11,366][__main__][INFO] - Starting iteration 589. [2026-03-25 23:07:11,370][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 23:07:11,370][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:07:16,125][__main__][INFO] - Number of regex retries in iteration 589: 0 [2026-03-25 23:07:16,127][__main__][INFO] - agents played in iteration 589 are Bob, Alice [2026-03-25 23:07:16,965][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:07:17,027][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:07:17,028][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:07:17,028][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:07:17,908][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:07:18,523][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:07:19,181][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:07:19,839][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:07:20,498][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:07:21,159][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:07:21,817][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:07:22,477][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:07:23,136][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:07:23,796][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:07:24,455][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:07:25,115][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:07:25,773][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:07:26,431][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:07:27,089][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:07:27,747][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:07:28,407][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:07:29,066][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:07:29,724][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:07:30,382][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:07:31,042][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:07:31,699][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:07:32,357][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:07:33,015][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:07:33,674][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:07:34,332][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:07:34,990][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:07:35,649][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:07:36,308][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:07:36,967][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:07:37,625][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:07:38,284][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:07:38,942][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:07:39,600][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:07:40,260][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:07:40,922][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:07:41,581][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:07:42,242][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:07:42,902][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:07:43,561][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:07:44,220][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:07:44,878][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:07:45,536][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:07:46,194][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:07:46,852][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:07:47,510][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:07:48,168][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:07:48,826][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:07:49,822][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:07:50,483][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:07:51,142][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:07:51,801][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:07:52,460][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:07:53,119][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:07:53,777][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:07:54,436][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:07:55,095][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:07:55,754][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:07:56,412][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:07:57,070][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:07:57,728][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:07:58,387][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:07:59,045][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:07:59,703][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:08:00,361][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:08:01,135][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 23:08:02,520][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:08:02,523][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:08:02,524][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:08:03,974][__main__][INFO] - Iteration 590 took 52s (9.04% Gen, 88.20% Train). Generation: 4s, Training: 46s. Estimated remaining time: 5h 34m 47s. Estimated total time: 14h 36m 46s. Time estimates for 10 more iterations: 8m 46s, 100 more iterations: 1h 27m 40s, 500 more iterations: 7h 18m 23s. [2026-03-25 23:08:03,976][__main__][INFO] - Starting iteration 590. [2026-03-25 23:08:03,980][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 23:08:03,981][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:08:09,623][__main__][INFO] - Number of regex retries in iteration 590: 0 [2026-03-25 23:08:09,624][__main__][INFO] - agents played in iteration 590 are Bob, Alice [2026-03-25 23:08:10,478][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:08:10,546][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:08:10,547][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:08:10,548][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:08:11,228][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:08:11,839][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:08:12,502][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:08:13,159][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:08:13,817][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:08:14,474][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:08:15,134][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:08:15,792][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:08:16,449][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:08:17,107][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:08:17,767][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:08:18,424][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:08:19,084][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:08:19,741][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:08:20,401][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:08:21,058][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:08:21,717][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:08:22,375][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:08:23,036][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:08:23,696][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:08:24,360][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:08:25,017][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:08:25,676][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:08:26,335][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:08:26,994][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:08:27,653][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:08:28,312][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:08:28,973][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:08:29,631][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:08:30,288][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:08:30,947][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:08:31,606][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:08:32,264][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:08:32,921][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:08:33,581][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:08:34,239][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:08:34,899][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:08:35,558][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:08:36,217][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:08:36,875][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:08:37,533][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:08:38,191][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:08:38,849][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:08:39,507][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:08:40,166][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:08:40,824][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:08:41,483][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:08:42,143][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:08:43,129][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:08:43,788][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:08:44,446][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:08:45,104][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:08:45,764][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:08:46,423][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:08:47,082][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:08:47,740][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:08:48,399][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:08:49,058][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:08:49,716][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:08:50,375][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:08:51,034][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:08:51,692][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:08:52,351][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:08:53,010][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:08:53,670][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:08:54,469][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 23:08:55,811][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:08:55,814][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:08:55,815][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:08:57,185][__main__][INFO] - Iteration 591 took 53s (10.61% Gen, 86.81% Train). Generation: 5s, Training: 46s. Estimated remaining time: 5h 43m 54s. Estimated total time: 14h 46m 46s. Time estimates for 10 more iterations: 8m 52s, 100 more iterations: 1h 28m 40s, 500 more iterations: 7h 23m 23s. [2026-03-25 23:08:57,187][__main__][INFO] - Starting iteration 591. [2026-03-25 23:08:57,191][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 23:08:57,192][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:09:01,771][__main__][INFO] - Number of regex retries in iteration 591: 0 [2026-03-25 23:09:01,773][__main__][INFO] - agents played in iteration 591 are Bob, Alice [2026-03-25 23:09:02,330][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:09:02,391][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:09:02,391][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:09:02,392][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:09:03,270][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:09:03,885][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:09:04,545][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:09:05,205][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:09:05,863][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:09:06,522][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:09:07,181][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:09:07,840][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:09:08,497][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:09:09,157][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:09:09,815][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:09:10,475][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:09:11,132][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:09:11,790][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:09:12,449][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:09:13,107][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:09:13,766][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:09:14,426][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:09:15,085][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:09:15,743][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:09:16,401][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:09:17,058][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:09:17,715][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:09:18,374][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:09:19,032][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:09:19,690][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:09:20,348][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:09:21,007][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:09:21,666][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:09:22,324][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:09:22,983][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:09:23,642][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:09:24,302][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:09:24,960][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:09:25,619][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:09:26,277][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:09:26,935][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:09:27,593][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:09:28,251][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:09:28,909][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:09:29,567][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:09:30,225][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:09:30,884][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:09:31,543][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:09:32,202][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:09:32,862][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:09:33,522][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:09:34,181][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:09:35,169][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:09:35,828][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:09:36,489][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:09:37,148][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:09:37,807][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:09:38,467][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:09:39,130][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:09:39,790][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:09:40,448][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:09:41,105][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:09:41,765][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:09:42,425][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:09:43,082][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:09:43,739][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:09:44,397][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:09:45,056][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:09:45,715][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:09:46,457][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 23:09:47,784][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:09:47,786][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:09:47,787][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:09:49,193][__main__][INFO] - Iteration 592 took 52s (8.81% Gen, 88.48% Train). Generation: 4s, Training: 46s. Estimated remaining time: 5h 23m 0s. Estimated total time: 14h 26m 44s. Time estimates for 10 more iterations: 8m 40s, 100 more iterations: 1h 26m 40s, 500 more iterations: 7h 13m 22s. [2026-03-25 23:09:49,196][__main__][INFO] - Starting iteration 592. [2026-03-25 23:09:49,200][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 23:09:49,201][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:09:55,368][__main__][INFO] - Number of regex retries in iteration 592: 0 [2026-03-25 23:09:55,369][__main__][INFO] - agents played in iteration 592 are Bob, Alice [2026-03-25 23:09:55,907][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:09:55,968][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:09:55,969][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:09:55,969][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:09:56,648][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:09:57,259][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:09:57,919][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:09:58,577][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:09:59,235][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:09:59,893][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:10:00,552][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:10:01,211][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:10:01,869][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:10:02,529][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:10:03,186][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:10:03,844][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:10:04,502][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:10:05,162][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:10:05,819][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:10:06,479][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:10:07,138][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:10:07,796][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:10:08,456][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:10:09,113][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:10:09,772][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:10:10,430][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:10:11,091][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:10:11,750][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:10:12,409][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:10:13,068][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:10:13,726][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:10:14,385][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:10:15,043][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:10:15,706][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:10:16,365][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:10:17,025][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:10:17,685][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:10:18,343][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:10:19,002][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:10:19,661][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:10:20,320][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:10:20,980][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:10:21,642][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:10:22,302][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:10:22,960][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:10:23,619][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:10:24,278][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:10:24,936][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:10:25,595][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:10:26,254][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:10:26,913][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:10:27,573][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:10:28,590][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:10:29,249][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:10:29,908][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:10:30,566][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:10:31,224][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:10:31,884][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:10:32,543][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:10:33,203][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:10:33,862][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:10:34,520][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:10:35,180][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:10:35,838][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:10:36,497][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:10:37,155][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:10:37,814][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:10:38,472][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:10:39,130][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:10:39,927][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 23:10:41,268][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:10:41,271][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:10:41,272][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:10:42,724][__main__][INFO] - Iteration 593 took 53s (11.52% Gen, 85.76% Train). Generation: 6s, Training: 45s. Estimated remaining time: 5h 47m 29s. Estimated total time: 14h 52m 6s. Time estimates for 10 more iterations: 8m 55s, 100 more iterations: 1h 29m 12s, 500 more iterations: 7h 26m 3s. [2026-03-25 23:10:42,726][__main__][INFO] - Starting iteration 593. [2026-03-25 23:10:42,730][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 23:10:42,731][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:10:47,298][__main__][INFO] - Number of regex retries in iteration 593: 0 [2026-03-25 23:10:47,299][__main__][INFO] - agents played in iteration 593 are Bob, Alice [2026-03-25 23:10:47,786][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:10:47,847][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:10:47,847][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:10:47,848][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:10:48,664][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:10:49,283][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:10:49,944][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:10:50,604][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:10:51,263][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:10:51,922][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:10:52,581][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:10:53,241][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:10:53,898][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:10:54,558][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:10:55,217][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:10:55,876][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:10:56,535][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:10:57,195][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:10:57,854][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:10:58,516][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:10:59,175][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:10:59,835][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:11:00,494][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:11:01,153][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:11:01,812][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:11:02,473][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:11:03,131][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:11:03,790][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:11:04,449][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:11:05,109][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:11:05,767][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:11:06,426][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:11:07,085][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:11:07,743][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:11:08,403][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:11:09,061][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:11:09,720][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:11:10,379][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:11:11,041][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:11:11,700][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:11:12,359][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:11:13,017][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:11:13,675][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:11:14,334][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:11:14,992][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:11:15,651][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:11:16,310][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:11:16,969][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:11:17,627][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:11:18,287][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:11:18,946][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:11:19,605][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:11:20,584][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:11:21,244][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:11:21,902][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:11:22,561][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:11:23,221][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:11:23,880][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:11:24,538][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:11:25,197][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:11:25,856][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:11:26,515][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:11:27,174][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:11:27,833][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:11:28,491][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:11:29,151][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:11:29,810][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:11:30,468][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:11:31,126][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:11:31,916][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 23:11:33,279][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:11:33,282][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:11:33,283][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:11:34,763][__main__][INFO] - Iteration 594 took 52s (8.78% Gen, 88.37% Train). Generation: 4s, Training: 45s. Estimated remaining time: 5h 21m 45s. Estimated total time: 14h 27m 14s. Time estimates for 10 more iterations: 8m 40s, 100 more iterations: 1h 26m 43s, 500 more iterations: 7h 13m 37s. [2026-03-25 23:11:34,765][__main__][INFO] - Starting iteration 594. [2026-03-25 23:11:34,769][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 23:11:34,769][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:11:39,357][__main__][INFO] - Number of regex retries in iteration 594: 0 [2026-03-25 23:11:39,359][__main__][INFO] - agents played in iteration 594 are Bob, Alice [2026-03-25 23:11:40,183][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:11:40,244][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:11:40,245][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:11:40,245][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:11:41,117][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:11:41,730][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:11:42,391][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:11:43,051][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:11:43,712][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:11:44,372][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:11:45,032][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:11:45,692][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:11:46,352][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:11:47,012][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:11:47,672][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:11:48,333][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:11:48,992][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:11:49,651][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:11:50,311][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:11:50,970][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:11:51,629][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:11:52,288][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:11:52,949][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:11:53,608][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:11:54,266][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:11:54,926][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:11:55,585][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:11:56,244][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:11:56,903][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:11:57,562][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:11:58,222][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:11:58,882][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:11:59,542][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:12:00,201][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:12:00,860][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:12:01,522][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:12:02,179][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:12:02,838][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:12:03,498][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:12:04,157][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:12:04,816][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:12:05,476][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:12:06,135][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:12:06,794][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:12:07,455][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:12:08,114][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:12:08,773][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:12:09,433][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:12:10,093][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:12:10,753][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:12:11,412][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:12:12,072][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:12:13,066][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:12:13,726][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:12:14,384][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:12:15,044][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:12:15,702][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:12:16,361][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:12:17,020][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:12:17,679][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:12:18,337][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:12:18,997][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:12:19,659][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:12:20,318][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:12:20,978][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:12:21,636][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:12:22,296][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:12:22,955][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:12:23,613][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:12:24,409][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 23:12:25,766][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:12:25,768][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:12:25,770][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:12:27,188][__main__][INFO] - Iteration 595 took 52s (8.76% Gen, 88.53% Train). Generation: 4s, Training: 46s. Estimated remaining time: 5h 27m 19s. Estimated total time: 14h 33m 40s. Time estimates for 10 more iterations: 8m 44s, 100 more iterations: 1h 27m 22s, 500 more iterations: 7h 16m 50s. [2026-03-25 23:12:27,190][__main__][INFO] - Starting iteration 595. [2026-03-25 23:12:27,194][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 23:12:27,194][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:12:35,410][__main__][INFO] - Number of regex retries in iteration 595: 0 [2026-03-25 23:12:35,411][__main__][INFO] - agents played in iteration 595 are Bob, Alice [2026-03-25 23:12:36,015][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:12:36,082][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:12:36,083][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:12:36,083][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:12:36,799][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:12:37,410][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:12:38,070][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:12:38,730][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:12:39,390][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:12:40,049][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:12:40,709][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:12:41,368][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:12:42,026][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:12:42,685][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:12:43,346][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:12:44,005][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:12:44,663][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:12:45,322][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:12:45,981][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:12:46,639][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:12:47,298][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:12:47,957][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:12:48,615][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:12:49,273][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:12:49,932][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:12:50,590][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:12:51,249][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:12:51,907][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:12:52,565][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:12:53,224][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:12:53,882][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:12:54,541][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:12:55,200][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:12:55,858][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:12:56,517][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:12:57,175][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:12:57,834][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:12:58,493][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:12:59,152][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:12:59,810][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:13:00,470][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:13:01,133][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:13:01,794][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:13:02,451][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:13:03,111][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:13:03,769][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:13:04,427][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:13:05,085][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:13:05,744][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:13:06,402][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:13:07,061][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:13:07,721][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:13:08,716][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:13:09,377][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:13:10,037][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:13:10,696][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:13:11,355][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:13:12,014][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:13:12,673][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:13:13,333][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:13:13,993][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:13:14,652][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:13:15,314][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:13:15,973][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:13:16,631][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:13:17,290][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:13:17,949][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:13:18,608][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:13:19,266][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:13:20,078][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 23:13:21,409][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:13:21,412][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:13:21,413][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:13:22,771][__main__][INFO] - Iteration 596 took 55s (14.78% Gen, 82.77% Train). Generation: 8s, Training: 46s. Estimated remaining time: 6h 19m 1s. Estimated total time: 15h 26m 18s. Time estimates for 10 more iterations: 9m 15s, 100 more iterations: 1h 32m 37s, 500 more iterations: 7h 43m 9s. [2026-03-25 23:13:22,773][__main__][INFO] - Starting iteration 596. [2026-03-25 23:13:22,776][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 23:13:22,776][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:13:27,788][__main__][INFO] - Number of regex retries in iteration 596: 0 [2026-03-25 23:13:27,789][__main__][INFO] - agents played in iteration 596 are Bob, Alice [2026-03-25 23:13:28,292][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:13:28,353][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:13:28,354][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:13:28,354][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:13:29,250][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:13:29,864][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:13:30,527][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:13:31,187][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:13:31,845][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:13:32,506][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:13:33,163][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:13:33,822][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:13:34,481][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:13:35,139][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:13:35,797][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:13:36,455][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:13:37,113][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:13:37,770][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:13:38,429][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:13:39,088][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:13:39,746][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:13:40,407][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:13:41,069][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:13:41,728][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:13:42,387][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:13:43,044][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:13:43,702][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:13:44,360][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:13:45,019][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:13:45,678][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:13:46,337][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:13:46,996][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:13:47,655][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:13:48,313][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:13:48,972][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:13:49,630][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:13:50,288][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:13:50,947][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:13:51,606][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:13:52,265][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:13:52,924][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:13:53,583][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:13:54,242][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:13:54,901][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:13:55,561][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:13:56,219][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:13:56,878][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:13:57,537][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:13:58,195][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:13:58,856][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:13:59,514][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:14:00,175][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:14:01,172][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:14:01,831][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:14:02,493][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:14:03,151][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:14:03,808][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:14:04,467][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:14:05,126][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:14:05,784][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:14:06,442][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:14:07,102][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:14:07,761][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:14:08,420][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:14:09,078][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:14:09,738][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:14:10,396][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:14:11,056][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:14:11,715][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:14:12,489][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 23:14:14,104][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:14:14,106][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:14:14,108][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:14:15,887][__main__][INFO] - Iteration 597 took 53s (9.44% Gen, 87.21% Train). Generation: 5s, Training: 46s. Estimated remaining time: 5h 37m 2s. Estimated total time: 14h 45m 12s. Time estimates for 10 more iterations: 8m 51s, 100 more iterations: 1h 28m 31s, 500 more iterations: 7h 22m 36s. [2026-03-25 23:14:15,889][__main__][INFO] - Starting iteration 597. [2026-03-25 23:14:15,893][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 23:14:15,893][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:14:21,283][__main__][INFO] - Number of regex retries in iteration 597: 0 [2026-03-25 23:14:21,284][__main__][INFO] - agents played in iteration 597 are Bob, Alice [2026-03-25 23:14:21,772][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:14:21,833][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:14:21,833][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:14:21,834][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:14:22,720][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:14:23,328][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:14:23,989][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:14:24,648][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:14:25,308][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:14:25,968][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:14:26,628][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:14:27,288][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:14:27,949][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:14:28,609][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:14:29,268][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:14:29,926][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:14:30,586][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:14:31,246][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:14:31,906][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:14:32,566][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:14:33,227][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:14:33,885][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:14:34,545][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:14:35,204][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:14:35,864][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:14:36,524][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:14:37,184][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:14:37,844][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:14:38,505][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:14:39,167][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:14:39,828][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:14:40,487][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:14:41,146][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:14:41,807][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:14:42,466][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:14:43,129][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:14:43,790][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:14:44,449][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:14:45,110][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:14:45,769][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:14:46,429][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:14:47,088][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:14:47,748][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:14:48,407][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:14:49,067][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:14:49,728][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:14:50,387][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:14:51,050][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:14:51,711][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:14:52,370][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:14:53,030][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:14:53,690][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:14:54,687][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:14:55,348][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:14:56,007][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:14:56,667][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:14:57,327][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:14:57,986][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:14:58,645][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:14:59,304][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:14:59,964][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:15:00,623][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:15:01,283][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:15:01,942][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:15:02,602][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:15:03,261][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:15:03,919][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:15:04,578][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:15:05,237][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:15:06,020][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 23:15:07,355][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:15:07,358][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:15:07,359][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:15:08,789][__main__][INFO] - Iteration 598 took 52s (10.19% Gen, 87.10% Train). Generation: 5s, Training: 46s. Estimated remaining time: 5h 32m 34s. Estimated total time: 14h 41m 37s. Time estimates for 10 more iterations: 8m 48s, 100 more iterations: 1h 28m 9s, 500 more iterations: 7h 20m 48s. [2026-03-25 23:15:08,791][__main__][INFO] - Starting iteration 598. [2026-03-25 23:15:08,794][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 23:15:08,795][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:15:13,657][__main__][INFO] - Number of regex retries in iteration 598: 0 [2026-03-25 23:15:13,659][__main__][INFO] - agents played in iteration 598 are Bob, Alice [2026-03-25 23:15:14,145][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:15:14,207][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:15:14,207][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:15:14,208][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:15:14,913][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:15:15,535][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:15:16,195][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:15:16,853][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:15:17,512][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:15:18,171][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:15:18,830][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:15:19,490][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:15:20,149][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:15:20,807][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:15:21,467][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:15:22,125][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:15:22,786][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:15:23,449][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:15:24,107][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:15:24,766][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:15:25,425][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:15:26,084][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:15:26,743][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:15:27,401][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:15:28,059][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:15:28,717][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:15:29,375][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:15:30,034][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:15:30,692][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:15:31,351][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:15:32,010][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:15:32,669][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:15:33,327][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:15:33,986][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:15:34,644][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:15:35,302][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:15:35,961][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:15:36,619][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:15:37,277][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:15:37,935][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:15:38,598][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:15:39,257][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:15:39,921][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:15:40,577][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:15:41,236][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:15:41,897][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:15:42,555][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:15:43,215][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:15:43,873][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:15:44,531][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:15:45,190][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:15:45,848][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:15:46,828][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:15:47,491][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:15:48,150][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:15:48,809][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:15:49,469][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:15:50,128][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:15:50,788][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:15:51,447][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:15:52,106][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:15:52,763][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:15:53,424][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:15:54,082][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:15:54,740][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:15:55,398][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:15:56,057][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:15:56,714][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:15:57,372][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:15:58,127][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 23:15:59,482][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:15:59,485][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:15:59,486][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:16:00,913][__main__][INFO] - Iteration 599 took 52s (9.33% Gen, 87.93% Train). Generation: 4s, Training: 45s. Estimated remaining time: 5h 18m 44s. Estimated total time: 14h 28m 40s. Time estimates for 10 more iterations: 8m 41s, 100 more iterations: 1h 26m 52s, 500 more iterations: 7h 14m 20s. [2026-03-25 23:16:00,915][__main__][INFO] - Starting iteration 599. [2026-03-25 23:16:00,919][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 23:16:00,920][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:16:07,362][__main__][INFO] - Number of regex retries in iteration 599: 0 [2026-03-25 23:16:07,363][__main__][INFO] - agents played in iteration 599 are Bob, Alice [2026-03-25 23:16:08,244][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:16:08,305][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:16:08,306][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:16:08,306][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:16:09,070][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:16:09,683][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:16:10,344][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:16:11,001][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:16:11,661][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:16:12,319][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:16:12,978][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:16:13,638][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:16:14,297][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:16:14,955][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:16:15,614][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:16:16,271][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:16:16,928][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:16:17,586][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:16:18,244][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:16:18,905][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:16:19,562][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:16:20,221][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:16:20,880][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:16:21,538][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:16:22,198][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:16:22,856][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:16:23,514][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:16:24,172][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:16:24,830][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:16:25,489][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:16:26,149][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:16:26,807][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:16:27,465][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:16:28,122][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:16:28,780][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:16:29,439][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:16:30,097][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:16:30,756][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:16:31,415][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:16:32,075][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:16:32,733][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:16:33,392][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:16:34,049][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:16:34,707][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:16:35,365][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:16:36,023][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:16:36,681][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:16:37,340][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:16:37,998][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:16:38,656][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:16:39,314][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:16:39,973][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:16:40,966][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:16:41,625][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:16:42,283][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:16:42,942][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:16:43,600][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:16:44,258][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:16:44,918][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:16:45,576][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:16:46,236][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:16:46,894][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:16:47,552][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:16:48,210][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:16:48,870][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:16:49,528][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:16:50,186][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:16:50,846][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:16:51,504][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:16:52,291][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 23:16:53,605][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:16:53,607][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:16:53,609][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:16:55,021][__main__][INFO] - Iteration 600 took 54s (11.91% Gen, 85.47% Train). Generation: 6s, Training: 46s. Estimated remaining time: 5h 50m 54s. Estimated total time: 15h 1m 43s. Time estimates for 10 more iterations: 9m 1s, 100 more iterations: 1h 30m 10s, 500 more iterations: 7h 30m 51s. [2026-03-25 23:16:55,023][__main__][INFO] - Starting iteration 600. [2026-03-25 23:16:55,028][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 23:16:55,029][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:17:00,851][__main__][INFO] - Number of regex retries in iteration 600: 0 [2026-03-25 23:17:00,852][__main__][INFO] - agents played in iteration 600 are Bob, Alice [2026-03-25 23:17:01,358][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:17:01,419][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:17:01,420][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:17:01,420][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:17:02,131][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:17:02,747][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:17:03,407][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:17:04,065][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:17:04,723][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:17:05,384][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:17:06,042][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:17:06,699][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:17:07,357][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:17:08,015][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:17:08,673][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:17:09,333][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:17:09,990][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:17:10,649][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:17:11,307][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:17:11,964][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:17:12,622][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:17:13,280][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:17:13,939][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:17:14,597][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:17:15,255][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:17:15,913][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:17:16,570][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:17:17,228][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:17:17,886][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:17:18,544][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:17:19,202][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:17:19,861][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:17:20,520][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:17:21,177][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:17:21,834][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:17:22,492][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:17:23,150][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:17:23,807][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:17:24,465][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:17:25,122][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:17:25,780][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:17:26,438][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:17:27,096][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:17:27,755][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:17:28,413][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:17:29,071][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:17:29,730][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:17:30,388][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:17:31,046][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:17:31,704][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:17:32,362][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:17:33,020][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:17:34,007][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:17:34,667][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:17:35,326][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:17:35,984][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:17:36,642][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:17:37,302][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:17:37,961][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:17:38,620][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:17:39,280][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:17:39,939][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:17:40,596][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:17:41,259][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:17:41,916][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:17:42,575][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:17:43,233][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:17:43,892][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:17:44,550][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:17:45,308][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 23:17:46,647][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:17:46,650][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:17:46,651][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:17:49,508][__main__][INFO] - Iteration 601 took 54s (10.69% Gen, 84.06% Train). Generation: 5s, Training: 45s. Estimated remaining time: 5h 56m 17s. Estimated total time: 15h 8m 1s. Time estimates for 10 more iterations: 9m 4s, 100 more iterations: 1h 30m 48s, 500 more iterations: 7h 34m 0s. [2026-03-25 23:17:49,510][__main__][INFO] - Starting iteration 601. [2026-03-25 23:17:49,514][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-25 23:17:49,515][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:17:56,113][__main__][INFO] - Number of regex retries in iteration 601: 0 [2026-03-25 23:17:56,114][__main__][INFO] - agents played in iteration 601 are Bob, Alice [2026-03-25 23:17:56,675][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:17:56,738][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:17:56,738][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:17:56,739][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:17:57,391][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:17:57,997][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:17:58,657][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:17:59,316][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:17:59,974][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:18:00,632][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:18:01,291][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:18:01,949][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:18:02,607][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:18:03,265][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:18:03,923][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:18:04,581][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:18:05,238][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:18:05,896][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:18:06,554][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:18:07,213][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:18:07,871][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:18:08,530][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:18:09,189][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:18:09,846][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:18:10,503][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:18:11,160][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:18:11,818][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:18:12,476][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:18:13,137][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:18:13,797][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:18:14,455][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:18:15,115][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:18:15,773][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:18:16,431][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:18:17,090][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:18:17,748][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:18:18,405][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:18:19,063][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:18:19,721][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:18:20,379][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:18:21,037][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:18:21,695][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:18:22,355][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:18:23,014][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:18:23,672][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:18:24,330][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:18:24,989][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:18:25,647][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:18:26,305][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:18:26,964][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:18:27,621][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:18:28,280][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:18:29,266][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:18:29,925][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:18:30,586][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:18:31,243][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:18:31,903][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:18:32,560][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:18:33,218][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:18:33,877][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:18:34,537][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:18:35,196][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:18:35,855][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:18:36,513][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:18:37,171][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:18:37,829][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:18:38,487][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:18:39,145][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:18:39,804][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:18:40,536][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 23:18:41,856][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:18:41,859][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:18:41,860][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:18:43,190][__main__][INFO] - Iteration 602 took 53s (12.29% Gen, 85.22% Train). Generation: 6s, Training: 45s. Estimated remaining time: 5h 41m 59s. Estimated total time: 14h 54m 37s. Time estimates for 10 more iterations: 8m 56s, 100 more iterations: 1h 29m 27s, 500 more iterations: 7h 27m 18s. [2026-03-25 23:18:43,191][__main__][INFO] - Starting iteration 602. [2026-03-25 23:18:43,195][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-25 23:18:43,195][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:18:50,919][__main__][INFO] - Number of regex retries in iteration 602: 0 [2026-03-25 23:18:50,920][__main__][INFO] - agents played in iteration 602 are Bob, Alice [2026-03-25 23:18:51,450][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:18:51,512][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:18:51,512][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:18:51,513][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:18:52,286][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:18:52,896][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:18:53,558][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:18:54,217][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:18:54,876][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:18:55,535][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:18:56,194][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:18:56,853][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:18:57,514][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:18:58,173][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:18:58,834][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:18:59,493][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:19:00,152][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:19:00,812][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:19:01,473][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:19:02,134][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:19:02,794][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:19:03,454][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:19:04,114][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:19:04,775][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:19:05,434][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:19:06,093][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:19:06,753][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:19:07,412][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:19:08,071][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:19:08,730][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:19:09,390][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:19:10,049][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:19:10,708][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:19:11,367][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:19:12,026][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:19:12,685][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:19:13,344][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:19:14,003][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:19:14,662][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:19:15,321][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:19:15,981][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:19:16,638][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:19:17,296][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:19:17,955][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:19:18,614][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:19:19,273][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:19:19,932][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:19:20,591][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:19:21,251][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:19:21,909][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:19:22,569][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:19:23,228][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:19:24,222][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:19:24,883][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:19:25,542][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:19:26,202][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:19:26,861][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:19:27,521][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:19:28,182][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:19:28,839][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:19:29,497][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:19:30,155][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:19:30,814][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:19:31,471][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:19:32,130][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:19:32,787][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:19:33,447][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:19:34,105][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:19:34,763][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:19:35,578][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 23:19:36,911][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:19:36,914][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:19:36,915][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:19:38,279][__main__][INFO] - Iteration 603 took 55s (14.02% Gen, 83.50% Train). Generation: 7s, Training: 45s. Estimated remaining time: 6h 4m 32s. Estimated total time: 15h 18m 5s. Time estimates for 10 more iterations: 9m 10s, 100 more iterations: 1h 31m 48s, 500 more iterations: 7h 39m 2s. [2026-03-25 23:19:38,281][__main__][INFO] - Starting iteration 603. [2026-03-25 23:19:38,285][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-25 23:19:38,285][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:19:43,832][__main__][INFO] - Number of regex retries in iteration 603: 0 [2026-03-25 23:19:43,833][__main__][INFO] - agents played in iteration 603 are Bob, Alice [2026-03-25 23:19:44,737][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:19:44,798][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:19:44,799][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:19:44,799][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:19:45,625][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:19:46,238][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:19:46,904][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:19:47,563][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:19:48,224][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:19:48,883][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:19:49,541][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:19:50,200][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:19:50,858][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:19:51,518][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:19:52,177][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:19:52,835][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:19:53,492][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:19:54,150][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:19:54,809][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:19:55,466][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:19:56,124][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:19:56,782][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:19:57,439][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:19:58,097][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:19:58,755][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:19:59,413][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:20:00,071][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:20:00,728][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:20:01,390][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:20:02,049][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:20:02,707][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:20:03,364][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:20:04,022][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:20:04,680][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:20:05,338][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:20:05,996][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:20:06,654][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:20:07,312][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:20:07,969][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:20:08,627][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:20:09,286][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:20:09,944][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:20:10,602][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:20:11,259][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:20:11,917][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:20:12,575][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:20:13,234][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:20:13,892][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:20:14,550][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:20:15,207][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:20:15,865][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:20:16,523][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:20:17,522][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:20:18,180][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:20:18,838][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:20:19,496][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:20:20,154][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:20:20,811][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:20:21,469][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:20:22,127][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:20:22,785][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:20:23,443][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:20:24,100][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:20:24,758][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:20:25,416][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:20:26,073][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:20:26,731][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:20:27,389][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:20:28,047][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:20:28,793][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 23:20:30,118][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:20:30,121][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:20:30,122][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:20:31,571][__main__][INFO] - Iteration 604 took 53s (10.41% Gen, 86.87% Train). Generation: 5s, Training: 46s. Estimated remaining time: 5h 33m 41s. Estimated total time: 14h 48m 8s. Time estimates for 10 more iterations: 8m 52s, 100 more iterations: 1h 28m 48s, 500 more iterations: 7h 24m 4s. [2026-03-25 23:20:31,574][__main__][INFO] - Starting iteration 604. [2026-03-25 23:20:31,579][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-25 23:20:31,579][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:20:38,913][__main__][INFO] - Number of regex retries in iteration 604: 0 [2026-03-25 23:20:38,914][__main__][INFO] - agents played in iteration 604 are Bob, Alice [2026-03-25 23:20:39,405][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:20:39,466][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:20:39,467][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:20:39,468][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:20:40,327][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:20:40,937][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:20:41,596][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:20:42,254][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:20:42,912][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:20:43,571][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:20:44,228][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:20:44,888][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:20:45,546][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:20:46,204][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:20:46,862][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:20:47,520][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:20:48,180][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:20:48,839][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:20:49,497][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:20:50,155][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:20:50,814][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:20:51,472][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:20:52,130][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:20:52,788][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:20:53,445][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:20:54,102][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:20:54,760][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:20:55,417][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:20:56,075][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:20:56,732][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:20:57,390][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:20:58,049][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:20:58,709][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:20:59,367][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:21:00,025][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:21:00,683][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:21:01,340][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:21:01,998][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:21:02,656][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:21:03,313][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:21:03,971][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:21:04,629][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:21:05,286][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:21:05,944][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:21:06,602][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:21:07,260][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:21:07,918][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:21:08,576][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:21:09,236][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:21:09,897][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:21:10,556][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:21:11,214][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:21:12,214][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:21:12,874][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:21:13,534][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:21:14,191][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:21:14,849][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:21:15,508][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:21:16,167][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:21:16,825][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:21:17,483][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:21:18,142][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:21:18,802][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:21:19,461][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:21:20,120][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:21:20,781][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:21:21,437][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:21:22,094][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:21:22,757][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:21:23,510][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 23:21:24,861][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:21:24,864][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:21:24,865][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:21:26,289][__main__][INFO] - Iteration 605 took 54s (13.40% Gen, 83.99% Train). Generation: 7s, Training: 45s. Estimated remaining time: 5h 56m 31s. Estimated total time: 15h 11m 52s. Time estimates for 10 more iterations: 9m 7s, 100 more iterations: 1h 31m 11s, 500 more iterations: 7h 35m 56s. [2026-03-25 23:21:26,291][__main__][INFO] - Starting iteration 605. [2026-03-25 23:21:26,295][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-25 23:21:26,296][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:21:31,546][__main__][INFO] - Number of regex retries in iteration 605: 0 [2026-03-25 23:21:31,547][__main__][INFO] - agents played in iteration 605 are Bob, Alice [2026-03-25 23:21:32,112][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:21:32,174][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:21:32,175][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:21:32,175][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:21:32,894][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:21:33,530][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:21:34,189][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:21:34,846][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:21:35,504][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:21:36,160][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:21:36,818][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:21:37,475][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:21:38,135][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:21:38,792][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:21:39,449][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:21:40,106][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:21:40,764][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:21:41,426][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:21:42,085][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:21:42,742][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:21:43,402][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:21:44,061][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:21:44,720][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:21:45,380][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:21:46,036][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:21:46,695][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:21:47,357][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:21:48,675][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:21:51,876][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:21:52,534][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:21:53,192][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:21:53,852][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:21:54,511][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:21:55,170][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:21:55,904][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:21:56,589][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:21:57,249][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:21:57,907][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:21:58,568][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:21:59,228][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:21:59,887][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:22:00,546][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:22:01,204][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:22:01,864][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:22:02,523][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:22:03,182][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:22:03,841][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:22:04,500][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:22:05,159][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:22:05,818][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:22:06,476][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:22:07,135][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:22:08,168][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:22:08,828][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:22:09,486][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:22:10,144][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:22:10,802][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:22:11,460][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:22:12,118][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:22:12,775][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:22:13,435][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:22:14,093][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:22:14,750][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:22:15,408][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:22:16,066][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:22:16,723][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:22:17,381][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:22:18,038][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:22:18,697][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:22:19,484][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:46 [2026-03-25 23:22:20,832][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:22:20,834][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:22:20,836][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:22:22,324][__main__][INFO] - Iteration 606 took 56s (9.37% Gen, 87.97% Train). Generation: 5s, Training: 49s. Estimated remaining time: 6h 17m 33s. Estimated total time: 15h 33m 50s. Time estimates for 10 more iterations: 9m 20s, 100 more iterations: 1h 33m 23s, 500 more iterations: 7h 46m 55s. [2026-03-25 23:22:22,327][__main__][INFO] - Starting iteration 606. [2026-03-25 23:22:22,363][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-25 23:22:22,363][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:22:28,690][__main__][INFO] - Number of regex retries in iteration 606: 0 [2026-03-25 23:22:28,691][__main__][INFO] - agents played in iteration 606 are Bob, Alice [2026-03-25 23:22:29,256][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:22:29,318][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:22:29,318][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:22:29,319][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:22:30,025][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:22:30,633][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:22:31,293][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:22:31,950][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:22:32,608][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:22:33,266][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:22:33,923][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:22:34,580][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:22:35,238][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:22:35,895][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:22:36,552][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:22:37,210][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:22:37,867][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:22:38,525][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:22:39,183][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:22:39,840][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:22:40,498][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:22:41,155][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:22:41,813][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:22:42,470][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:22:43,128][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:22:43,787][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:22:44,446][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:22:45,104][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:22:45,764][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:22:46,422][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:22:47,080][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:22:47,738][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:22:48,395][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:22:49,053][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:22:49,710][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:22:50,368][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:22:51,025][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:22:51,682][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:22:52,340][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:22:52,998][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:22:53,655][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:22:54,313][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:22:54,970][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:22:55,628][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:22:56,286][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:22:56,945][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:22:57,603][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:22:58,261][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:22:58,920][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:22:59,578][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:23:00,236][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:23:00,893][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:23:01,871][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:23:02,531][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:23:03,189][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:23:03,847][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:23:04,506][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:23:05,166][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:23:05,823][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:23:06,482][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:23:07,141][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:23:07,798][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:23:08,458][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:23:09,117][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:23:09,775][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:23:10,432][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:23:11,092][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:23:11,751][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:23:12,410][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:23:13,091][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 23:23:14,444][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:23:14,446][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:23:14,448][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:23:15,887][__main__][INFO] - Iteration 607 took 53s (11.82% Gen, 85.48% Train). Generation: 6s, Training: 45s. Estimated remaining time: 5h 34m 55s. Estimated total time: 14h 52m 6s. Time estimates for 10 more iterations: 8m 55s, 100 more iterations: 1h 29m 12s, 500 more iterations: 7h 26m 3s. [2026-03-25 23:23:15,889][__main__][INFO] - Starting iteration 607. [2026-03-25 23:23:15,893][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-25 23:23:15,893][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:23:21,526][__main__][INFO] - Number of regex retries in iteration 607: 0 [2026-03-25 23:23:21,527][__main__][INFO] - agents played in iteration 607 are Bob, Alice [2026-03-25 23:23:22,434][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:23:22,495][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:23:22,496][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:23:22,496][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:23:23,314][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:23:23,921][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:23:24,582][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:23:25,239][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:23:25,897][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:23:26,555][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:23:27,213][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:23:27,870][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:23:28,529][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:23:29,187][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:23:29,846][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:23:30,503][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:23:31,161][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:23:31,820][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:23:32,479][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:23:33,138][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:23:33,795][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:23:34,453][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:23:35,111][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:23:35,769][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:23:36,427][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:23:37,085][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:23:37,742][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:23:38,399][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:23:39,057][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:23:39,714][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:23:40,372][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:23:41,029][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:23:41,686][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:23:42,344][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:23:43,002][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:23:43,660][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:23:44,318][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:23:44,975][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:23:45,632][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:23:46,290][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:23:46,947][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:23:47,604][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:23:48,262][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:23:48,920][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:23:49,579][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:23:50,238][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:23:50,896][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:23:51,553][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:23:52,210][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:23:52,868][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:23:53,525][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:23:54,183][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:23:55,163][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:23:55,824][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:23:56,481][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:23:57,141][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:23:57,798][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:23:58,457][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:23:59,116][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:23:59,775][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:24:00,434][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:24:01,091][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:24:01,750][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:24:02,407][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:24:03,065][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:24:03,723][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:24:04,382][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:24:05,040][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:24:05,697][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:24:06,474][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 23:24:07,792][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:24:07,795][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:24:07,796][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:24:09,282][__main__][INFO] - Iteration 608 took 53s (10.55% Gen, 86.66% Train). Generation: 5s, Training: 46s. Estimated remaining time: 5h 31m 46s. Estimated total time: 14h 49m 50s. Time estimates for 10 more iterations: 8m 53s, 100 more iterations: 1h 28m 59s, 500 more iterations: 7h 24m 55s. [2026-03-25 23:24:09,283][__main__][INFO] - Starting iteration 608. [2026-03-25 23:24:09,288][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-25 23:24:09,289][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:24:22,094][__main__][INFO] - Number of regex retries in iteration 608: 0 [2026-03-25 23:24:22,095][__main__][INFO] - agents played in iteration 608 are Bob, Alice [2026-03-25 23:24:22,701][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:24:22,762][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:24:22,763][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:24:22,763][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:24:23,591][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:24:24,207][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:24:24,867][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:24:25,527][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:24:26,186][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:24:26,844][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:24:27,503][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:24:28,163][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:24:28,824][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:24:29,482][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:24:30,141][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:24:30,799][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:24:31,457][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:24:32,116][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:24:32,775][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:24:33,433][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:24:34,092][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:24:34,750][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:24:35,410][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:24:36,069][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:24:36,727][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:24:37,386][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:24:38,045][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:24:38,703][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:24:39,362][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:24:40,022][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:24:40,681][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:24:41,345][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:24:42,003][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:24:42,663][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:24:43,319][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:24:43,977][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:24:44,636][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:24:45,294][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:24:45,952][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:24:46,613][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:24:47,273][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:24:47,932][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:24:48,591][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:24:49,249][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:24:49,910][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:24:50,570][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:24:51,232][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:24:51,890][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:24:52,550][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:24:53,209][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:24:53,869][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:24:54,527][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:24:55,522][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:24:56,181][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:24:56,840][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:24:57,498][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:24:58,158][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:24:58,817][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:24:59,475][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:25:00,133][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:25:00,792][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:25:01,451][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:25:02,110][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:25:02,767][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:25:03,426][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:25:04,085][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:25:04,743][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:25:05,402][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:25:06,059][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:25:06,861][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 23:25:08,223][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:25:08,226][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:25:08,227][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:25:09,694][__main__][INFO] - Iteration 609 took 1m 0s (21.20% Gen, 76.37% Train). Generation: 12s, Training: 46s. Estimated remaining time: 7h 27m 43s. Estimated total time: 16h 46m 48s. Time estimates for 10 more iterations: 10m 4s, 100 more iterations: 1h 40m 40s, 500 more iterations: 8h 23m 24s. [2026-03-25 23:25:09,697][__main__][INFO] - Starting iteration 609. [2026-03-25 23:25:09,702][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-25 23:25:09,702][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:25:16,530][__main__][INFO] - Number of regex retries in iteration 609: 0 [2026-03-25 23:25:16,531][__main__][INFO] - agents played in iteration 609 are Bob, Alice [2026-03-25 23:25:17,041][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:25:17,102][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:25:17,103][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:25:17,104][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:25:17,776][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:25:18,388][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:25:19,047][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:25:19,705][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:25:20,362][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:25:21,021][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:25:21,679][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:25:22,336][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:25:22,995][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:25:23,653][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:25:24,312][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:25:24,970][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:25:25,627][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:25:26,286][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:25:26,945][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:25:27,605][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:25:28,263][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:25:28,921][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:25:29,579][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:25:30,236][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:25:30,895][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:25:31,554][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:25:32,212][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:25:32,869][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:25:33,527][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:25:34,186][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:25:34,845][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:25:35,503][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:25:36,161][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:25:36,818][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:25:37,476][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:25:38,134][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:25:38,792][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:25:39,450][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:25:40,108][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:25:40,767][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:25:41,425][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:25:42,083][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:25:42,743][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:25:43,402][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:25:44,060][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:25:44,719][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:25:45,377][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:25:46,035][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:25:46,693][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:25:47,351][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:25:48,009][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:25:48,669][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:25:49,659][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:25:50,318][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:25:50,978][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:25:51,637][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:25:52,295][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:25:52,954][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:25:53,613][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:25:54,271][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:25:54,929][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:25:55,587][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:25:56,246][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:25:56,905][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:25:57,563][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:25:58,222][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:25:58,880][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:25:59,538][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:26:00,196][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:26:00,967][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 23:26:02,290][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:26:02,294][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:26:02,295][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:26:03,811][__main__][INFO] - Iteration 610 took 54s (12.62% Gen, 84.57% Train). Generation: 6s, Training: 45s. Estimated remaining time: 5h 41m 53s. Estimated total time: 15h 1m 52s. Time estimates for 10 more iterations: 9m 1s, 100 more iterations: 1h 30m 11s, 500 more iterations: 7h 30m 56s. [2026-03-25 23:26:03,813][__main__][INFO] - Starting iteration 610. [2026-03-25 23:26:03,817][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-25 23:26:03,818][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:26:10,046][__main__][INFO] - Number of regex retries in iteration 610: 0 [2026-03-25 23:26:12,213][__main__][INFO] - agents played in iteration 610 are Bob, Alice [2026-03-25 23:26:12,698][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:26:12,759][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:26:12,760][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:26:12,760][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:26:13,606][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:26:14,216][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:26:14,876][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:26:15,535][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:26:16,194][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:26:16,853][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:26:17,511][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:26:18,169][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:26:18,830][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:26:19,490][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:26:20,154][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:26:20,810][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:26:21,471][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:26:22,129][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:26:22,787][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:26:23,445][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:26:24,103][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:26:24,761][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:26:25,420][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:26:26,078][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:26:26,736][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:26:27,395][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:26:28,052][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:26:28,710][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:26:29,368][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:26:30,026][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:26:30,683][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:26:31,341][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:26:31,999][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:26:32,657][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:26:33,315][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:26:33,973][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:26:34,630][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:26:35,288][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:26:35,946][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:26:36,603][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:26:37,261][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:26:37,920][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:26:38,578][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:26:39,236][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:26:39,894][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:26:40,552][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:26:41,210][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:26:41,868][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:26:42,526][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:26:43,183][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:26:43,841][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:26:44,499][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:26:45,484][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:26:46,150][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:26:46,808][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:26:47,466][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:26:48,125][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:26:48,783][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:26:49,441][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:26:50,100][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:26:50,759][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:26:51,417][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:26:52,075][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:26:52,734][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:26:53,391][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:26:54,050][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:26:54,708][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:26:55,367][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:26:56,028][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:26:56,847][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 23:26:58,176][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:26:59,033][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:26:59,034][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:27:00,733][__main__][INFO] - Iteration 611 took 56s (14.75% Gen, 82.26% Train). Generation: 8s, Training: 46s. Estimated remaining time: 6h 27m 42s. Estimated total time: 15h 48m 37s. Time estimates for 10 more iterations: 9m 29s, 100 more iterations: 1h 34m 51s, 500 more iterations: 7h 54m 18s. [2026-03-25 23:27:00,735][__main__][INFO] - Starting iteration 611. [2026-03-25 23:27:00,740][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-25 23:27:00,741][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:27:06,478][__main__][INFO] - Number of regex retries in iteration 611: 0 [2026-03-25 23:27:06,479][__main__][INFO] - agents played in iteration 611 are Bob, Alice [2026-03-25 23:27:07,399][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:27:07,461][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:27:07,462][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:27:07,462][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:27:08,131][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:27:08,745][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:27:09,406][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:27:10,064][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:27:10,723][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:27:11,383][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:27:12,042][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:27:12,700][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:27:13,360][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:27:14,017][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:27:14,678][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:27:15,335][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:27:15,993][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:27:16,651][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:27:17,309][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:27:17,967][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:27:18,624][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:27:19,285][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:27:19,943][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:27:20,602][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:27:21,259][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:27:21,916][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:27:22,575][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:27:23,234][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:27:23,891][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:27:24,550][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:27:25,209][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:27:25,867][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:27:26,524][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:27:27,182][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:27:27,840][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:27:28,498][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:27:29,156][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:27:29,813][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:27:30,471][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:27:31,129][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:27:31,787][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:27:32,445][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:27:33,103][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:27:33,761][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:27:34,419][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:27:35,077][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:27:35,735][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:27:36,393][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:27:37,051][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:27:37,709][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:27:38,366][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:27:39,023][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:27:40,006][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:27:40,667][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:27:41,325][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:27:41,984][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:27:42,642][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:27:43,301][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:27:43,960][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:27:44,617][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:27:45,275][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:27:45,934][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:27:46,592][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:27:47,251][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:27:47,909][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:27:48,566][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:27:49,225][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:27:49,884][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:27:50,541][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:27:51,324][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 23:27:52,662][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:27:52,664][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:27:52,666][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:27:54,011][__main__][INFO] - Iteration 612 took 53s (10.77% Gen, 86.70% Train). Generation: 5s, Training: 46s. Estimated remaining time: 5h 26m 5s. Estimated total time: 14h 47m 53s. Time estimates for 10 more iterations: 8m 52s, 100 more iterations: 1h 28m 47s, 500 more iterations: 7h 23m 56s. [2026-03-25 23:27:54,013][__main__][INFO] - Starting iteration 612. [2026-03-25 23:27:54,018][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-25 23:27:54,018][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:28:05,423][__main__][INFO] - Number of regex retries in iteration 612: 0 [2026-03-25 23:28:05,424][__main__][INFO] - agents played in iteration 612 are Bob, Alice [2026-03-25 23:28:06,073][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:28:06,134][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:28:06,135][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:28:06,136][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:28:06,788][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:28:07,398][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:28:08,058][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:28:08,715][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:28:09,372][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:28:10,030][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:28:10,688][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:28:11,344][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:28:12,001][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:28:12,658][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:28:13,315][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:28:13,971][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:28:14,628][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:28:15,286][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:28:15,946][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:28:16,603][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:28:17,260][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:28:17,917][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:28:18,577][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:28:19,234][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:28:19,891][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:28:20,548][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:28:21,205][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:28:21,864][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:28:22,519][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:28:23,177][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:28:23,836][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:28:24,494][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:28:25,153][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:28:25,812][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:28:26,469][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:28:27,129][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:28:27,787][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:28:28,446][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:28:29,104][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:28:29,762][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:28:30,421][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:28:31,079][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:28:31,737][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:28:32,396][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:28:33,053][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:28:33,711][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:28:34,369][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:28:35,029][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:28:35,687][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:28:36,345][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:28:37,003][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:28:37,660][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:28:38,644][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:28:39,304][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:28:39,964][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:28:40,621][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:28:41,279][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:28:41,940][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:28:42,600][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:28:43,259][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:28:43,916][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:28:44,574][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:28:45,232][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:28:45,890][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:28:46,548][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:28:47,206][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:28:47,865][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:28:48,524][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:28:49,182][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:28:49,964][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 23:28:51,302][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:28:51,305][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:28:51,306][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:28:52,751][__main__][INFO] - Iteration 613 took 58s (19.42% Gen, 78.12% Train). Generation: 11s, Training: 45s. Estimated remaining time: 6h 56m 7s. Estimated total time: 16h 18m 55s. Time estimates for 10 more iterations: 9m 47s, 100 more iterations: 1h 37m 53s, 500 more iterations: 8h 9m 27s. [2026-03-25 23:28:52,753][__main__][INFO] - Starting iteration 613. [2026-03-25 23:28:52,757][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-25 23:28:52,758][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:28:59,227][__main__][INFO] - Number of regex retries in iteration 613: 0 [2026-03-25 23:28:59,228][__main__][INFO] - agents played in iteration 613 are Bob, Alice [2026-03-25 23:28:59,746][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:28:59,807][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:28:59,808][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:28:59,808][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:29:00,468][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:29:01,080][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:29:01,739][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:29:02,396][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:29:03,055][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:29:03,713][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:29:04,370][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:29:05,027][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:29:05,684][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:29:06,341][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:29:06,998][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:29:07,656][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:29:08,313][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:29:08,970][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:29:09,629][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:29:10,288][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:29:10,947][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:29:11,607][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:29:12,264][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:29:12,921][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:29:13,579][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:29:14,236][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:29:14,893][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:29:15,550][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:29:16,208][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:29:16,865][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:29:17,523][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:29:18,181][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:29:18,838][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:29:19,496][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:29:20,155][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:29:20,813][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:29:21,473][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:29:22,131][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:29:22,789][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:29:23,447][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:29:24,106][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:29:24,764][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:29:25,422][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:29:26,080][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:29:26,739][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:29:27,397][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:29:28,055][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:29:28,713][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:29:29,372][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:29:30,029][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:29:30,687][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:29:31,345][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:29:32,339][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:29:32,998][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:29:33,658][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:29:34,315][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:29:34,975][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:29:35,633][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:29:36,291][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:29:36,948][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:29:37,607][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:29:38,264][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:29:38,921][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:29:39,580][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:29:40,238][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:29:40,896][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:29:41,555][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:29:42,213][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:29:42,871][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:29:43,649][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 23:29:44,992][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:29:44,995][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:29:44,996][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:29:46,367][__main__][INFO] - Iteration 614 took 53s (12.07% Gen, 85.37% Train). Generation: 6s, Training: 45s. Estimated remaining time: 5h 29m 50s. Estimated total time: 14h 53m 31s. Time estimates for 10 more iterations: 8m 56s, 100 more iterations: 1h 29m 21s, 500 more iterations: 7h 26m 45s. [2026-03-25 23:29:46,369][__main__][INFO] - Starting iteration 614. [2026-03-25 23:29:46,374][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-25 23:29:46,375][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:29:51,268][__main__][INFO] - Number of regex retries in iteration 614: 0 [2026-03-25 23:29:51,269][__main__][INFO] - agents played in iteration 614 are Bob, Alice [2026-03-25 23:29:51,876][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:29:51,938][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:29:51,939][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:29:51,939][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:29:52,760][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:29:53,371][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:29:54,032][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:29:54,689][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:29:55,346][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:29:56,003][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:29:56,661][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:29:57,319][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:29:57,977][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:29:58,635][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:29:59,296][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:29:59,953][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:30:00,611][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:30:01,269][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:30:01,927][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:30:02,584][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:30:03,243][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:30:03,902][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:30:04,561][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:30:05,217][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:30:05,875][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:30:06,532][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:30:07,190][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:30:07,849][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:30:08,507][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:30:09,166][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:30:09,824][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:30:10,483][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:30:11,140][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:30:11,797][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:30:12,454][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:30:13,111][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:30:13,768][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:30:14,426][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:30:15,083][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:30:15,741][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:30:16,399][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:30:17,057][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:30:17,714][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:30:18,372][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:30:19,030][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:30:19,689][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:30:20,347][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:30:21,004][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:30:21,662][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:30:22,320][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:30:22,977][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:30:23,635][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:30:24,631][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:30:25,290][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:30:25,948][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:30:26,609][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:30:27,267][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:30:27,925][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:30:28,586][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:30:29,244][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:30:29,902][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:30:30,562][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:30:31,220][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:30:31,882][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:30:32,541][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:30:33,201][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:30:33,859][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:30:34,520][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:30:35,177][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:30:35,985][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 23:30:37,339][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:30:37,342][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:30:37,343][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:30:38,715][__main__][INFO] - Iteration 615 took 52s (9.35% Gen, 88.02% Train). Generation: 4s, Training: 46s. Estimated remaining time: 5h 7m 50s. Estimated total time: 14h 32m 23s. Time estimates for 10 more iterations: 8m 43s, 100 more iterations: 1h 27m 14s, 500 more iterations: 7h 16m 11s. [2026-03-25 23:30:38,717][__main__][INFO] - Starting iteration 615. [2026-03-25 23:30:38,722][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-25 23:30:38,722][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:30:43,840][__main__][INFO] - Number of regex retries in iteration 615: 0 [2026-03-25 23:30:43,840][__main__][INFO] - agents played in iteration 615 are Bob, Alice [2026-03-25 23:30:44,736][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:30:44,797][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:30:44,798][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:30:44,798][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:30:45,542][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:30:46,158][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:30:46,817][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:30:47,476][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:30:48,136][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:30:48,795][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:30:49,455][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:30:50,114][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:30:50,775][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:30:51,434][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:30:52,093][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:30:52,753][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:30:53,412][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:30:54,070][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:30:54,729][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:30:55,387][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:30:56,046][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:30:56,705][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:30:57,363][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:30:58,021][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:30:58,681][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:30:59,341][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:30:59,999][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:31:00,657][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:31:01,316][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:31:01,974][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:31:02,634][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:31:03,292][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:31:03,951][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:31:04,609][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:31:05,268][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:31:05,926][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:31:06,585][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:31:08,753][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:31:10,252][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:31:10,909][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:31:11,566][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:31:12,223][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:31:12,881][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:31:13,538][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:31:14,196][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:31:14,854][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:31:15,511][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:31:16,168][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:31:16,825][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:31:17,483][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:31:18,140][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:31:18,797][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:31:19,790][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:31:20,448][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:31:21,107][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:31:21,765][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:31:22,423][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:31:23,081][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:31:23,739][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:31:24,396][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:31:25,056][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:31:25,713][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:31:26,371][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:31:27,029][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:31:27,686][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:31:28,345][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:31:29,002][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:31:29,659][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:31:30,317][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:31:31,104][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:45 [2026-03-25 23:31:32,436][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:31:32,439][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:31:32,441][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:31:34,105][__main__][INFO] - Iteration 616 took 55s (9.24% Gen, 87.75% Train). Generation: 5s, Training: 48s. Estimated remaining time: 5h 57m 36s. Estimated total time: 15h 23m 5s. Time estimates for 10 more iterations: 9m 13s, 100 more iterations: 1h 32m 18s, 500 more iterations: 7h 41m 32s. [2026-03-25 23:31:34,107][__main__][INFO] - Starting iteration 616. [2026-03-25 23:31:34,112][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-25 23:31:34,112][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:31:36,708][mllm.models.large_language_model_local][WARNING] - Response %A> did not match regex: (|), retry 1/1 [2026-03-25 23:31:39,809][__main__][INFO] - Number of regex retries in iteration 616: 1 [2026-03-25 23:31:39,810][__main__][INFO] - agents played in iteration 616 are Bob, Alice [2026-03-25 23:31:40,413][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:31:40,474][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:31:40,475][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:31:40,475][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:31:41,248][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:31:41,863][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:31:42,528][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:31:43,186][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:31:43,847][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:31:45,468][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:31:47,507][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:31:48,166][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:31:48,826][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:31:49,484][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:31:50,142][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:31:50,803][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:31:51,461][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:31:52,120][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:31:52,777][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:31:53,435][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:31:54,093][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:31:54,751][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:31:55,409][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:31:58,506][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:31:59,165][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:31:59,823][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:32:00,480][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:32:01,138][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:32:01,795][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:32:02,453][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:32:03,111][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:32:03,768][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:32:04,425][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:32:05,083][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:32:06,553][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:32:08,949][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:32:09,609][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:32:10,268][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:32:10,926][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:32:11,585][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:32:12,244][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:32:12,903][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:32:13,562][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:32:14,221][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:32:14,878][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:32:15,537][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:32:16,194][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:32:16,852][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:32:17,510][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:32:18,167][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:32:18,825][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:32:19,484][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:32:20,481][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:32:21,139][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:32:21,798][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:32:22,456][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:32:23,114][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:32:23,771][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:32:24,437][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:32:25,096][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:32:25,753][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:32:26,411][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:32:27,074][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:32:27,734][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:32:28,394][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:32:29,053][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:32:29,712][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:32:30,370][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:32:31,029][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:32:31,844][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:50 [2026-03-25 23:32:33,208][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:32:33,210][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:32:33,212][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:32:34,692][__main__][INFO] - Iteration 617 took 1m 0s (9.40% Gen, 88.15% Train). Generation: 5s, Training: 53s. Estimated remaining time: 7h 23m 12s. Estimated total time: 16h 49m 42s. Time estimates for 10 more iterations: 10m 5s, 100 more iterations: 1h 40m 58s, 500 more iterations: 8h 24m 51s. [2026-03-25 23:32:34,694][__main__][INFO] - Starting iteration 617. [2026-03-25 23:32:34,699][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-25 23:32:34,700][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:32:39,605][__main__][INFO] - Number of regex retries in iteration 617: 0 [2026-03-25 23:32:39,606][__main__][INFO] - agents played in iteration 617 are Bob, Alice [2026-03-25 23:32:40,113][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:32:40,174][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:32:40,174][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:32:40,175][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:32:40,888][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:32:41,503][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:32:42,162][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:32:42,828][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:32:43,487][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:32:44,146][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:32:44,804][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:32:45,464][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:32:46,122][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:32:46,781][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:32:47,439][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:32:48,097][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:32:48,755][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:32:49,412][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:32:50,070][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:32:50,728][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:32:51,386][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:32:52,043][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:32:52,701][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:32:53,360][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:32:54,018][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:32:54,676][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:32:55,335][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:32:55,992][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:32:56,650][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:32:57,309][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:32:57,967][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:32:58,627][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:32:59,285][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:32:59,943][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:33:00,602][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:33:01,261][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:33:01,919][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:33:02,576][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:33:03,234][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:33:03,892][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:33:04,550][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:33:05,208][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:33:05,865][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:33:06,523][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:33:07,182][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:33:07,840][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:33:08,499][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:33:09,157][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:33:09,814][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:33:10,472][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:33:11,131][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:33:11,789][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:33:12,776][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:33:13,436][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:33:14,095][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:33:14,753][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:33:15,410][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:33:16,068][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:33:16,727][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:33:17,385][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:33:18,042][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:33:18,702][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:33:19,359][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:33:20,017][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:33:20,675][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:33:21,333][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:33:21,991][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:33:22,650][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:33:23,309][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:33:24,123][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 23:33:25,505][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:33:25,508][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:33:25,509][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:33:26,975][__main__][INFO] - Iteration 618 took 52s (9.38% Gen, 87.81% Train). Generation: 4s, Training: 45s. Estimated remaining time: 5h 3m 57s. Estimated total time: 14h 31m 19s. Time estimates for 10 more iterations: 8m 42s, 100 more iterations: 1h 27m 7s, 500 more iterations: 7h 15m 39s. [2026-03-25 23:33:26,978][__main__][INFO] - Starting iteration 618. [2026-03-25 23:33:26,982][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-25 23:33:26,982][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:33:36,784][__main__][INFO] - Number of regex retries in iteration 618: 0 [2026-03-25 23:33:36,785][__main__][INFO] - agents played in iteration 618 are Bob, Alice [2026-03-25 23:33:37,567][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:33:37,630][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:33:37,631][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:33:37,631][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:33:38,305][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:33:38,955][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:33:39,573][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:33:40,230][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:33:40,888][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:33:41,545][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:33:42,203][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:33:42,863][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:33:43,521][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:33:44,178][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:33:44,836][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:33:45,495][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:33:46,153][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:33:46,811][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:33:47,471][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:33:48,128][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:33:48,786][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:33:49,444][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:33:50,101][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:33:50,759][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:33:51,417][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:33:52,075][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:33:52,732][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:33:53,390][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:33:54,049][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:33:54,707][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:33:55,364][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:33:56,022][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:33:56,680][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:33:57,337][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:33:57,995][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:33:58,653][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:33:59,311][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:33:59,969][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:34:00,627][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:34:01,286][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:34:01,944][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:34:02,606][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:34:03,266][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:34:03,924][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:34:04,582][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:34:05,240][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:34:05,898][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:34:06,556][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:34:07,214][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:34:07,872][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:34:08,530][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:34:09,188][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:34:10,173][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:34:10,834][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:34:11,492][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:34:12,152][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:34:12,811][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:34:13,470][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:34:14,129][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:34:14,787][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:34:15,445][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:34:16,104][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:34:16,762][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:34:17,420][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:34:18,078][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:34:18,736][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:34:19,394][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:34:20,053][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:34:20,712][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:34:21,492][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 23:34:22,857][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:34:22,860][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:34:22,861][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:34:24,239][__main__][INFO] - Iteration 619 took 57s (17.12% Gen, 80.47% Train). Generation: 9s, Training: 46s. Estimated remaining time: 6h 26m 0s. Estimated total time: 15h 54m 19s. Time estimates for 10 more iterations: 9m 32s, 100 more iterations: 1h 35m 25s, 500 more iterations: 7h 57m 9s. [2026-03-25 23:34:24,242][__main__][INFO] - Starting iteration 619. [2026-03-25 23:34:24,248][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-25 23:34:24,249][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:34:29,891][__main__][INFO] - Number of regex retries in iteration 619: 0 [2026-03-25 23:34:29,892][__main__][INFO] - agents played in iteration 619 are Bob, Alice [2026-03-25 23:34:30,800][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:34:30,863][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:34:30,863][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:34:30,864][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:34:31,577][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:34:32,185][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:34:32,845][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:34:33,505][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:34:34,166][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:34:34,826][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:34:35,485][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:34:36,145][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:34:36,803][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:34:37,461][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:34:38,121][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:34:38,779][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:34:39,438][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:34:40,097][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:34:40,757][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:34:41,416][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:34:42,075][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:34:42,733][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:34:43,391][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:34:44,050][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:34:44,709][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:34:45,369][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:34:46,028][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:34:46,687][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:34:47,346][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:34:48,004][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:34:48,662][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:34:49,321][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:34:49,980][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:34:50,639][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:34:51,298][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:34:51,957][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:34:52,616][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:34:53,274][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:34:53,933][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:34:54,591][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:34:55,250][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:34:55,909][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:34:56,568][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:34:57,226][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:34:57,886][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:34:58,545][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:34:59,205][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:34:59,863][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:35:00,522][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:35:01,181][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:35:01,841][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:35:02,506][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:35:03,531][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:35:04,189][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:35:04,846][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:35:05,505][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:35:06,163][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:35:06,821][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:35:07,478][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:35:08,136][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:35:08,793][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:35:09,451][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:35:10,112][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:35:10,770][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:35:11,428][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:35:12,086][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:35:12,743][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:35:13,404][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:35:14,062][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:35:14,826][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 23:35:16,197][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:35:16,200][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:35:16,201][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:35:17,551][__main__][INFO] - Iteration 620 took 53s (10.59% Gen, 86.87% Train). Generation: 5s, Training: 46s. Estimated remaining time: 5h 19m 14s. Estimated total time: 14h 48m 26s. Time estimates for 10 more iterations: 8m 53s, 100 more iterations: 1h 28m 50s, 500 more iterations: 7h 24m 13s. [2026-03-25 23:35:17,553][__main__][INFO] - Starting iteration 620. [2026-03-25 23:35:17,558][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-25 23:35:17,558][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:35:24,157][__main__][INFO] - Number of regex retries in iteration 620: 0 [2026-03-25 23:35:24,159][__main__][INFO] - agents played in iteration 620 are Bob, Alice [2026-03-25 23:35:24,775][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:35:24,836][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:35:24,837][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:35:24,837][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:35:25,517][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:35:26,124][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:35:26,784][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:35:27,442][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:35:28,103][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:35:28,764][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:35:29,424][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:35:30,083][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:35:30,742][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:35:31,402][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:35:32,061][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:35:32,719][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:35:33,377][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:35:34,036][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:35:34,695][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:35:35,353][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:35:36,012][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:35:36,672][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:35:37,330][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:35:37,989][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:35:38,648][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:35:39,307][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:35:39,966][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:35:40,625][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:35:41,284][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:35:41,943][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:35:42,601][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:35:43,260][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:35:43,919][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:35:44,578][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:35:45,237][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:35:45,896][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:35:46,555][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:35:47,214][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:35:47,872][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:35:48,532][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:35:51,057][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:35:51,715][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:35:52,373][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:35:53,031][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:35:53,690][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:35:54,349][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:35:55,007][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:35:55,666][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:35:56,325][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:35:56,985][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:35:57,645][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:35:58,304][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:35:59,309][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:35:59,967][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:36:00,626][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:36:01,285][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:36:01,942][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:36:02,601][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:36:03,259][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:36:03,917][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:36:04,575][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:36:05,233][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:36:05,891][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:36:06,549][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:36:07,207][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:36:07,866][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:36:08,524][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:36:09,182][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:36:09,840][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:36:10,617][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:45 [2026-03-25 23:36:11,961][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:36:11,963][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:36:11,965][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:36:13,537][__main__][INFO] - Iteration 621 took 55s (11.79% Gen, 85.40% Train). Generation: 6s, Training: 47s. Estimated remaining time: 6h 2m 53s. Estimated total time: 15h 33m 1s. Time estimates for 10 more iterations: 9m 19s, 100 more iterations: 1h 33m 18s, 500 more iterations: 7h 46m 30s. [2026-03-25 23:36:13,539][__main__][INFO] - Starting iteration 621. [2026-03-25 23:36:13,544][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-25 23:36:13,544][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:36:18,282][__main__][INFO] - Number of regex retries in iteration 621: 0 [2026-03-25 23:36:18,283][__main__][INFO] - agents played in iteration 621 are Bob, Alice [2026-03-25 23:36:18,786][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:36:18,848][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:36:18,849][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:36:18,849][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:36:19,509][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:36:20,127][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:36:20,786][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:36:21,445][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:36:22,105][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:36:22,763][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:36:23,426][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:36:24,086][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:36:24,745][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:36:25,407][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:36:26,064][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:36:26,723][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:36:27,383][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:36:28,041][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:36:28,700][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:36:29,359][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:36:30,018][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:36:30,678][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:36:31,337][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:36:31,995][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:36:32,654][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:36:33,313][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:36:33,973][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:36:34,631][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:36:35,290][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:36:35,948][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:36:36,608][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:36:37,267][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:36:37,926][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:36:38,584][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:36:39,243][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:36:39,902][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:36:40,561][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:36:41,220][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:36:41,879][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:36:42,538][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:36:43,197][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:36:43,855][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:36:44,515][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:36:45,174][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:36:45,832][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:36:46,491][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:36:47,149][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:36:47,807][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:36:48,466][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:36:49,124][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:36:49,782][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:36:50,441][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:36:51,434][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:36:52,093][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:36:52,751][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:36:53,409][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:36:54,068][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:36:54,725][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:36:55,383][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:36:56,041][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:36:56,698][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:36:57,356][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:36:58,014][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:36:58,673][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:36:59,335][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:36:59,993][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:37:00,651][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:37:01,309][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:37:01,967][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:37:02,744][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 23:37:04,086][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:37:04,089][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:37:04,090][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:37:05,511][__main__][INFO] - Iteration 622 took 51s (9.12% Gen, 88.14% Train). Generation: 4s, Training: 45s. Estimated remaining time: 4h 55m 9s. Estimated total time: 14h 26m 9s. Time estimates for 10 more iterations: 8m 39s, 100 more iterations: 1h 26m 36s, 500 more iterations: 7h 13m 4s. [2026-03-25 23:37:05,513][__main__][INFO] - Starting iteration 622. [2026-03-25 23:37:05,519][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-25 23:37:05,519][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:37:10,319][__main__][INFO] - Number of regex retries in iteration 622: 0 [2026-03-25 23:37:10,320][__main__][INFO] - agents played in iteration 622 are Bob, Alice [2026-03-25 23:37:10,811][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:37:10,872][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:37:10,873][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:37:10,873][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:37:11,535][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:37:12,144][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:37:12,804][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:37:13,463][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:37:14,121][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:37:14,782][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:37:15,442][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:37:16,101][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:37:16,764][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:37:17,424][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:37:18,083][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:37:18,742][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:37:19,402][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:37:20,061][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:37:20,720][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:37:21,379][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:37:22,039][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:37:22,697][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:37:23,356][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:37:24,014][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:37:24,673][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:37:25,331][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:37:25,990][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:37:26,647][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:37:27,306][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:37:27,965][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:37:28,624][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:37:29,282][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:37:29,940][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:37:30,599][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:37:31,258][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:37:31,917][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:37:32,576][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:37:33,233][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:37:33,892][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:37:34,550][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:37:35,208][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:37:35,867][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:37:36,527][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:37:37,185][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:37:37,843][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:37:38,502][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:37:39,162][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:37:39,820][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:37:40,478][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:37:41,138][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:37:41,797][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:37:42,456][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:37:43,448][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:37:44,106][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:37:44,764][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:37:45,422][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:37:46,080][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:37:46,738][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:37:47,396][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:37:48,053][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:37:48,712][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:37:49,372][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:37:50,030][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:37:50,688][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:37:51,345][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:37:52,004][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:37:52,662][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:37:53,320][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:37:53,977][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:37:54,752][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 23:37:56,093][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:37:56,095][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:37:56,096][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:37:57,497][__main__][INFO] - Iteration 623 took 51s (9.24% Gen, 88.06% Train). Generation: 4s, Training: 45s. Estimated remaining time: 4h 54m 28s. Estimated total time: 14h 26m 20s. Time estimates for 10 more iterations: 8m 39s, 100 more iterations: 1h 26m 38s, 500 more iterations: 7h 13m 10s. [2026-03-25 23:37:57,499][__main__][INFO] - Starting iteration 623. [2026-03-25 23:37:57,504][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-25 23:37:57,505][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:38:06,320][__main__][INFO] - Number of regex retries in iteration 623: 0 [2026-03-25 23:38:06,322][__main__][INFO] - agents played in iteration 623 are Bob, Alice [2026-03-25 23:38:07,422][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:38:07,484][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:38:07,485][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:38:07,485][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:38:08,182][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:38:08,806][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:38:09,467][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:38:10,125][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:38:10,784][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:38:11,441][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:38:12,102][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:38:12,759][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:38:13,417][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:38:14,074][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:38:14,732][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:38:15,390][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:38:16,048][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:38:16,705][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:38:17,363][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:38:18,021][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:38:18,679][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:38:19,336][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:38:19,993][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:38:20,651][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:38:21,308][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:38:21,966][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:38:22,623][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:38:23,280][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:38:23,938][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:38:24,596][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:38:25,253][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:38:25,910][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:38:26,568][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:38:27,226][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:38:27,883][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:38:28,542][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:38:29,199][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:38:29,856][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:38:30,514][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:38:31,171][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:38:31,828][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:38:32,486][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:38:33,143][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:38:33,801][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:38:34,459][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:38:35,117][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:38:35,774][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:38:36,432][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:38:37,090][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:38:37,748][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:38:38,406][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:38:39,063][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:38:40,049][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:38:40,708][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:38:41,367][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:38:42,025][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:38:42,687][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:38:43,343][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:38:44,001][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:38:44,658][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:38:45,316][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:38:45,973][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:38:46,630][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:38:47,293][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:38:47,950][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:38:48,608][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:38:49,266][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:38:49,924][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:38:50,583][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:38:51,341][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 23:38:52,700][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:38:52,703][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:38:52,704][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:38:54,165][__main__][INFO] - Iteration 624 took 56s (15.56% Gen, 81.85% Train). Generation: 8s, Training: 46s. Estimated remaining time: 6h 11m 34s. Estimated total time: 15h 44m 23s. Time estimates for 10 more iterations: 9m 26s, 100 more iterations: 1h 34m 26s, 500 more iterations: 7h 52m 11s. [2026-03-25 23:38:54,168][__main__][INFO] - Starting iteration 624. [2026-03-25 23:38:54,173][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-25 23:38:54,174][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:39:02,683][__main__][INFO] - Number of regex retries in iteration 624: 0 [2026-03-25 23:39:02,684][__main__][INFO] - agents played in iteration 624 are Bob, Alice [2026-03-25 23:39:03,176][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:39:03,238][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:39:03,239][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:39:03,239][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:39:03,900][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:39:04,516][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:39:05,174][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:39:05,832][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:39:06,491][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:39:07,149][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:39:07,813][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:39:08,468][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:39:09,124][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:39:09,784][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:39:10,442][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:39:11,100][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:39:11,758][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:39:12,415][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:39:13,073][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:39:13,731][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:39:14,388][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:39:15,046][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:39:15,703][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:39:16,361][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:39:17,019][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:39:17,676][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:39:18,333][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:39:18,991][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:39:19,651][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:39:20,308][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:39:20,966][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:39:21,624][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:39:22,282][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:39:22,940][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:39:23,597][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:39:24,254][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:39:24,913][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:39:25,571][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:39:26,228][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:39:26,886][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:39:27,543][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:39:28,201][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:39:28,859][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:39:29,517][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:39:30,175][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:39:30,833][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:39:31,491][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:39:32,154][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:39:32,811][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:39:33,469][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:39:34,128][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:39:34,785][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:39:35,788][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:39:36,447][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:39:37,106][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:39:37,763][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:39:38,421][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:39:39,079][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:39:39,738][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:39:40,395][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:39:41,054][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:39:41,711][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:39:42,370][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:39:43,027][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:39:43,685][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:39:44,343][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:39:45,002][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:39:45,659][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:39:46,317][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:39:47,091][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 23:39:48,410][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:39:48,412][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:39:48,414][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:39:49,860][__main__][INFO] - Iteration 625 took 55s (15.28% Gen, 82.11% Train). Generation: 8s, Training: 45s. Estimated remaining time: 5h 54m 25s. Estimated total time: 15h 28m 10s. Time estimates for 10 more iterations: 9m 16s, 100 more iterations: 1h 32m 49s, 500 more iterations: 7h 44m 5s. [2026-03-25 23:39:49,863][__main__][INFO] - Starting iteration 625. [2026-03-25 23:39:49,868][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-25 23:39:49,869][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:39:54,780][__main__][INFO] - Number of regex retries in iteration 625: 0 [2026-03-25 23:39:54,781][__main__][INFO] - agents played in iteration 625 are Bob, Alice [2026-03-25 23:39:55,264][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:39:55,325][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:39:55,326][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:39:55,326][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:39:55,995][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:39:56,605][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:39:57,265][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:39:57,923][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:39:58,582][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:39:59,240][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:39:59,898][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:40:00,555][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:40:01,212][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:40:01,869][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:40:02,527][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:40:03,184][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:40:03,841][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:40:04,498][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:40:05,155][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:40:05,812][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:40:06,471][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:40:07,129][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:40:07,788][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:40:08,445][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:40:09,102][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:40:09,760][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:40:10,417][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:40:11,076][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:40:11,733][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:40:12,390][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:40:13,049][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:40:13,706][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:40:14,363][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:40:15,021][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:40:15,678][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:40:16,335][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:40:16,996][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:40:17,656][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:40:18,317][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:40:18,975][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:40:19,634][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:40:20,293][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:40:20,951][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:40:21,609][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:40:22,269][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:40:22,928][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:40:23,586][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:40:24,245][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:40:24,903][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:40:25,560][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:40:26,220][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:40:26,879][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:40:27,865][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:40:28,526][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:40:29,189][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:40:29,846][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:40:30,503][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:40:31,161][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:40:31,819][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:40:32,478][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:40:33,136][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:40:33,796][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:40:34,453][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:40:35,111][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:40:35,768][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:40:36,425][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:40:37,084][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:40:37,741][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:40:38,400][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:40:39,193][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 23:40:40,527][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:40:40,529][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:40:40,531][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:40:41,949][__main__][INFO] - Iteration 626 took 52s (9.43% Gen, 87.84% Train). Generation: 4s, Training: 45s. Estimated remaining time: 4h 53m 28s. Estimated total time: 14h 28m 4s. Time estimates for 10 more iterations: 8m 40s, 100 more iterations: 1h 26m 48s, 500 more iterations: 7h 14m 2s. [2026-03-25 23:40:41,952][__main__][INFO] - Starting iteration 626. [2026-03-25 23:40:41,956][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-25 23:40:41,956][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:40:53,612][__main__][INFO] - Number of regex retries in iteration 626: 0 [2026-03-25 23:40:53,613][__main__][INFO] - agents played in iteration 626 are Bob, Alice [2026-03-25 23:40:54,695][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:40:54,757][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:40:54,758][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:40:54,758][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:40:55,603][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:40:56,222][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:40:56,883][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:40:57,541][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:40:58,200][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:40:58,859][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:40:59,519][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:41:00,178][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:41:00,847][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:41:01,505][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:41:02,164][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:41:02,822][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:41:03,481][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:41:04,140][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:41:04,799][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:41:05,458][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:41:06,116][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:41:06,775][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:41:07,433][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:41:08,091][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:41:08,749][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:41:09,407][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:41:10,066][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:41:10,726][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:41:11,385][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:41:12,043][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:41:12,703][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:41:13,360][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:41:14,018][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:41:14,677][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:41:15,335][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:41:15,994][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:41:16,652][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:41:17,310][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:41:17,969][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:41:18,628][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:41:19,287][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:41:19,947][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:41:20,606][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:41:21,264][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:41:21,922][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:41:22,582][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:41:23,240][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:41:23,899][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:41:24,557][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:41:25,216][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:41:25,874][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:41:26,534][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:41:27,537][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:41:28,196][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:41:28,854][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:41:29,513][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:41:30,171][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:41:30,831][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:41:31,489][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:41:32,146][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:41:32,805][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:41:33,463][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:41:34,121][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:41:34,780][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:41:35,438][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:41:36,095][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:41:36,753][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:41:37,411][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:41:38,069][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:41:38,887][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 23:41:40,230][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:41:40,232][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:41:40,234][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:41:41,940][__main__][INFO] - Iteration 627 took 59s (19.43% Gen, 77.72% Train). Generation: 11s, Training: 46s. Estimated remaining time: 7h 4m 9s. Estimated total time: 16h 39m 46s. Time estimates for 10 more iterations: 9m 59s, 100 more iterations: 1h 39m 58s, 500 more iterations: 8h 19m 53s. [2026-03-25 23:41:41,943][__main__][INFO] - Starting iteration 627. [2026-03-25 23:41:41,949][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-25 23:41:41,949][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:41:47,583][__main__][INFO] - Number of regex retries in iteration 627: 0 [2026-03-25 23:41:47,584][__main__][INFO] - agents played in iteration 627 are Bob, Alice [2026-03-25 23:41:48,513][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:41:48,575][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:41:48,575][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:41:48,576][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:41:49,408][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:41:50,083][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:41:50,708][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:41:51,365][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:41:52,023][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:41:52,681][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:41:53,338][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:41:53,997][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:41:54,655][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:41:55,313][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:41:55,970][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:41:56,629][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:41:57,287][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:41:57,945][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:41:58,603][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:41:59,260][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:41:59,917][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:42:00,575][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:42:01,233][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:42:01,892][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:42:02,549][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:42:03,207][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:42:03,864][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:42:04,522][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:42:05,179][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:42:05,837][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:42:06,495][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:42:07,153][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:42:07,811][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:42:08,469][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:42:09,127][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:42:09,784][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:42:10,442][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:42:11,099][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:42:11,757][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:42:12,414][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:42:13,072][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:42:13,729][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:42:14,387][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:42:15,048][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:42:15,709][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:42:16,370][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:42:17,026][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:42:17,686][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:42:18,344][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:42:19,003][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:42:19,661][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:42:20,318][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:42:21,318][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:42:21,977][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:42:22,635][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:42:23,293][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:42:23,951][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:42:24,608][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:42:25,267][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:42:25,925][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:42:26,583][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:42:27,241][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:42:27,900][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:42:28,558][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:42:29,216][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:42:29,875][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:42:30,533][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:42:31,191][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:42:31,849][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:42:32,580][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 23:42:33,967][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:42:33,970][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:42:33,971][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:42:35,323][__main__][INFO] - Iteration 628 took 53s (10.56% Gen, 86.90% Train). Generation: 5s, Training: 46s. Estimated remaining time: 5h 13m 7s. Estimated total time: 14h 49m 37s. Time estimates for 10 more iterations: 8m 53s, 100 more iterations: 1h 28m 57s, 500 more iterations: 7h 24m 48s. [2026-03-25 23:42:35,325][__main__][INFO] - Starting iteration 628. [2026-03-25 23:42:35,330][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-25 23:42:35,330][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:42:48,495][__main__][INFO] - Number of regex retries in iteration 628: 0 [2026-03-25 23:42:48,497][__main__][INFO] - agents played in iteration 628 are Bob, Alice [2026-03-25 23:42:49,084][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:42:49,145][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:42:49,146][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:42:49,146][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:42:49,980][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:42:50,592][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:42:51,253][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:42:51,910][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:42:52,571][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:42:53,230][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:42:53,893][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:42:54,551][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:42:55,210][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:42:55,868][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:42:56,527][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:42:57,186][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:42:57,845][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:42:58,505][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:42:59,164][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:42:59,823][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:43:00,481][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:43:01,139][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:43:01,799][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:43:02,459][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:43:03,117][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:43:03,775][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:43:04,434][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:43:05,093][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:43:05,752][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:43:06,411][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:43:07,070][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:43:07,729][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:43:08,388][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:43:09,047][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:43:09,707][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:43:10,366][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:43:11,025][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:43:11,683][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:43:12,342][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:43:13,001][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:43:13,660][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:43:14,318][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:43:14,976][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:43:15,635][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:43:16,293][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:43:16,954][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:43:17,610][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:43:18,270][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:43:18,929][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:43:19,588][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:43:20,249][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:43:20,908][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:43:21,894][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:43:22,553][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:43:23,212][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:43:23,869][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:43:24,527][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:43:25,185][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:43:25,842][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:43:26,501][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:43:27,159][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:43:27,818][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:43:28,476][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:43:29,134][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:43:29,793][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:43:30,451][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:43:31,109][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:43:31,767][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:43:32,425][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:43:33,274][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 23:43:34,698][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:43:34,701][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:43:34,702][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:43:36,081][__main__][INFO] - Iteration 629 took 1m 0s (21.67% Gen, 76.05% Train). Generation: 13s, Training: 46s. Estimated remaining time: 7h 15m 3s. Estimated total time: 16h 52m 34s. Time estimates for 10 more iterations: 10m 7s, 100 more iterations: 1h 41m 15s, 500 more iterations: 8h 26m 17s. [2026-03-25 23:43:36,084][__main__][INFO] - Starting iteration 629. [2026-03-25 23:43:36,088][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-25 23:43:36,089][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:43:40,983][__main__][INFO] - Number of regex retries in iteration 629: 0 [2026-03-25 23:43:40,984][__main__][INFO] - agents played in iteration 629 are Bob, Alice [2026-03-25 23:43:41,497][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:43:41,559][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:43:41,560][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:43:41,560][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:43:42,295][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:43:42,906][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:43:43,566][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:43:44,225][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:43:44,884][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:43:45,542][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:43:46,201][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:43:46,859][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:43:47,519][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:43:48,177][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:43:48,838][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:43:49,494][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:43:50,152][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:43:50,809][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:43:51,468][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:43:52,127][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:43:52,785][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:43:53,444][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:43:54,102][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:43:54,760][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:43:55,420][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:43:56,079][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:43:56,738][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:43:57,395][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:43:58,055][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:43:58,714][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:43:59,371][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:44:00,030][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:44:00,688][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:44:01,348][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:44:02,007][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:44:02,667][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:44:03,326][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:44:03,984][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:44:04,641][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:44:05,299][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:44:05,957][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:44:06,616][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:44:07,274][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:44:07,931][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:44:08,589][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:44:09,248][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:44:09,906][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:44:10,564][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:44:11,222][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:44:11,879][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:44:12,537][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:44:13,195][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:44:14,189][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:44:14,848][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:44:15,505][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:44:16,165][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:44:16,823][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:44:17,481][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:44:18,145][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:44:18,803][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:44:19,463][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:44:20,121][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:44:20,779][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:44:21,437][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:44:22,095][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:44:22,754][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:44:23,412][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:44:24,072][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:44:24,730][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:44:25,521][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 23:44:26,909][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:44:26,912][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:44:26,913][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:44:28,250][__main__][INFO] - Iteration 630 took 52s (9.38% Gen, 88.05% Train). Generation: 4s, Training: 45s. Estimated remaining time: 4h 51m 0s. Estimated total time: 14h 29m 23s. Time estimates for 10 more iterations: 8m 41s, 100 more iterations: 1h 26m 56s, 500 more iterations: 7h 14m 41s. [2026-03-25 23:44:28,252][__main__][INFO] - Starting iteration 630. [2026-03-25 23:44:28,256][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-25 23:44:28,257][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:44:34,237][__main__][INFO] - Number of regex retries in iteration 630: 0 [2026-03-25 23:44:34,238][__main__][INFO] - agents played in iteration 630 are Bob, Alice [2026-03-25 23:44:35,133][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:44:35,196][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:44:35,196][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:44:35,197][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:44:35,956][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:44:36,573][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:44:37,233][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:44:37,891][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:44:38,550][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:44:39,208][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:44:39,867][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:44:40,527][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:44:41,187][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:44:41,846][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:44:42,511][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:44:43,170][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:44:43,830][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:44:44,488][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:44:45,151][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:44:45,812][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:44:46,474][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:44:47,132][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:44:47,791][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:44:48,450][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:44:49,108][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:44:49,767][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:44:50,426][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:44:51,085][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:44:51,744][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:44:52,403][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:44:53,062][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:44:53,720][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:44:54,380][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:44:55,039][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:44:55,698][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:44:56,358][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:44:57,017][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:44:57,676][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:44:58,336][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:44:58,996][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:44:59,655][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:45:00,314][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:45:00,974][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:45:01,632][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:45:02,292][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:45:02,952][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:45:03,611][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:45:04,269][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:45:04,928][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:45:05,587][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:45:06,246][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:45:06,905][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:45:07,899][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:45:08,557][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:45:09,215][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:45:09,873][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:45:10,532][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:45:11,190][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:45:11,848][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:45:12,506][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:45:13,164][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:45:13,821][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:45:14,479][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:45:15,137][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:45:15,795][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:45:16,455][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:45:17,114][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:45:17,772][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:45:18,430][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:45:19,252][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 23:45:20,622][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:45:20,625][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:45:20,626][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:45:21,970][__main__][INFO] - Iteration 631 took 53s (11.13% Gen, 86.36% Train). Generation: 5s, Training: 46s. Estimated remaining time: 5h 15m 59s. Estimated total time: 14h 55m 16s. Time estimates for 10 more iterations: 8m 57s, 100 more iterations: 1h 29m 31s, 500 more iterations: 7h 27m 38s. [2026-03-25 23:45:21,972][__main__][INFO] - Starting iteration 631. [2026-03-25 23:45:21,978][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-25 23:45:21,979][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:45:28,744][__main__][INFO] - Number of regex retries in iteration 631: 0 [2026-03-25 23:45:28,745][__main__][INFO] - agents played in iteration 631 are Bob, Alice [2026-03-25 23:45:29,229][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:45:29,291][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:45:29,291][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:45:29,292][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:45:30,120][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:45:30,734][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:45:31,399][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:45:32,058][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:45:32,719][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:45:33,378][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:45:34,037][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:45:34,699][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:45:35,357][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:45:36,019][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:45:36,679][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:45:37,337][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:45:37,996][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:45:38,656][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:45:39,314][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:45:39,974][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:45:40,633][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:45:41,293][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:45:41,951][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:45:42,612][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:45:43,271][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:45:43,931][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:45:44,591][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:45:45,250][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:45:45,909][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:45:46,568][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:45:47,230][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:45:47,889][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:45:48,550][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:45:49,209][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:45:49,867][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:45:50,527][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:45:51,186][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:45:51,844][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:45:52,504][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:45:53,162][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:45:53,821][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:45:54,479][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:45:55,138][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:45:55,797][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:45:56,455][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:45:57,115][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:45:57,774][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:45:58,435][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:45:59,094][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:45:59,752][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:46:00,411][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:46:01,070][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:46:02,066][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:46:02,726][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:46:03,385][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:46:04,042][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:46:04,700][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:46:05,360][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:46:06,017][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:46:06,674][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:46:07,334][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:46:07,992][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:46:08,650][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:46:09,309][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:46:09,969][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:46:10,628][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:46:11,288][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:46:11,945][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:46:12,603][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:46:13,391][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 23:46:14,726][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:46:14,728][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:46:14,729][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:46:16,203][__main__][INFO] - Iteration 632 took 54s (12.48% Gen, 84.80% Train). Generation: 6s, Training: 45s. Estimated remaining time: 5h 23m 37s. Estimated total time: 15h 3m 48s. Time estimates for 10 more iterations: 9m 2s, 100 more iterations: 1h 30m 22s, 500 more iterations: 7h 31m 54s. [2026-03-25 23:46:16,205][__main__][INFO] - Starting iteration 632. [2026-03-25 23:46:16,209][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-25 23:46:16,210][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:46:22,864][__main__][INFO] - Number of regex retries in iteration 632: 0 [2026-03-25 23:46:22,866][__main__][INFO] - agents played in iteration 632 are Bob, Alice [2026-03-25 23:46:23,453][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:46:23,515][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:46:23,515][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:46:23,516][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:46:24,344][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:46:24,950][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:46:25,611][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:46:26,268][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:46:26,926][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:46:27,584][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:46:28,242][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:46:28,900][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:46:29,557][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:46:30,216][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:46:30,874][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:46:31,532][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:46:32,192][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:46:32,849][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:46:33,506][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:46:34,164][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:46:34,823][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:46:35,480][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:46:36,139][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:46:36,796][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:46:37,456][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:46:38,113][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:46:38,771][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:46:39,429][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:46:40,086][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:46:40,744][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:46:41,401][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:46:42,058][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:46:42,716][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:46:43,374][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:46:44,033][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:46:44,690][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:46:45,347][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:46:46,008][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:46:46,666][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:46:47,323][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:46:47,982][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:46:48,639][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:46:49,296][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:46:49,955][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:46:50,613][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:46:51,270][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:46:51,928][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:46:52,587][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:46:53,244][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:46:53,902][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:46:54,561][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:46:55,220][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:46:56,215][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:46:56,874][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:46:57,532][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:46:58,191][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:46:58,849][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:46:59,507][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:47:00,165][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:47:00,823][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:47:01,480][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:47:02,138][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:47:02,797][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:47:03,454][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:47:04,112][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:47:04,769][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:47:05,427][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:47:06,086][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:47:06,744][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:47:07,520][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 23:47:08,838][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:47:08,840][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:47:08,842][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:47:10,173][__main__][INFO] - Iteration 633 took 53s (12.33% Gen, 85.19% Train). Generation: 6s, Training: 45s. Estimated remaining time: 5h 18m 21s. Estimated total time: 14h 59m 25s. Time estimates for 10 more iterations: 8m 59s, 100 more iterations: 1h 29m 56s, 500 more iterations: 7h 29m 42s. [2026-03-25 23:47:10,175][__main__][INFO] - Starting iteration 633. [2026-03-25 23:47:10,179][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-25 23:47:10,180][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:47:15,307][__main__][INFO] - Number of regex retries in iteration 633: 0 [2026-03-25 23:47:15,308][__main__][INFO] - agents played in iteration 633 are Bob, Alice [2026-03-25 23:47:15,802][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:47:15,864][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:47:15,864][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:47:15,865][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:47:16,686][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:47:17,290][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:47:17,951][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:47:18,611][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:47:19,269][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:47:19,928][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:47:20,586][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:47:21,245][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:47:21,903][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:47:22,561][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:47:23,219][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:47:23,877][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:47:24,535][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:47:25,193][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:47:25,851][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:47:26,509][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:47:27,167][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:47:27,824][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:47:28,484][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:47:29,142][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:47:29,800][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:47:30,458][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:47:31,116][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:47:31,774][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:47:32,432][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:47:33,091][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:47:33,749][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:47:34,408][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:47:35,066][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:47:35,724][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:47:36,381][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:47:37,041][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:47:37,699][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:47:38,357][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:47:39,018][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:47:39,680][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:47:40,336][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:47:40,996][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:47:41,655][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:47:42,313][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:47:42,971][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:47:43,628][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:47:44,286][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:47:44,944][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:47:45,601][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:47:46,258][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:47:46,916][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:47:47,573][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:47:48,562][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:47:49,221][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:47:49,880][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:47:50,538][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:47:51,195][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:47:51,854][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:47:52,512][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:47:53,171][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:47:53,828][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:47:54,486][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:47:55,144][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:47:55,801][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:47:56,459][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:47:57,116][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:47:57,774][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:47:58,432][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:47:59,090][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:47:59,874][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 23:48:01,201][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:48:01,205][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:48:01,206][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:48:02,595][__main__][INFO] - Iteration 634 took 52s (9.78% Gen, 87.56% Train). Generation: 5s, Training: 45s. Estimated remaining time: 4h 51m 40s. Estimated total time: 14h 33m 37s. Time estimates for 10 more iterations: 8m 44s, 100 more iterations: 1h 27m 21s, 500 more iterations: 7h 16m 48s. [2026-03-25 23:48:02,597][__main__][INFO] - Starting iteration 634. [2026-03-25 23:48:02,602][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-25 23:48:02,603][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:48:09,702][__main__][INFO] - Number of regex retries in iteration 634: 0 [2026-03-25 23:48:09,703][__main__][INFO] - agents played in iteration 634 are Bob, Alice [2026-03-25 23:48:10,193][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:48:10,254][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:48:10,255][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:48:10,255][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:48:11,021][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:48:11,625][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:48:12,286][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:48:12,947][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:48:13,611][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:48:14,268][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:48:14,926][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:48:15,587][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:48:16,250][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:48:16,910][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:48:17,568][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:48:18,228][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:48:18,887][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:48:19,546][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:48:20,204][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:48:20,862][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:48:21,521][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:48:22,182][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:48:22,842][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:48:23,502][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:48:24,163][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:48:24,821][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:48:25,480][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:48:26,139][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:48:26,797][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:48:27,455][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:48:28,114][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:48:28,772][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:48:29,432][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:48:30,091][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:48:30,749][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:48:31,408][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:48:32,067][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:48:32,726][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:48:33,385][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:48:34,043][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:48:34,701][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:48:35,360][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:48:36,018][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:48:36,676][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:48:37,335][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:48:37,993][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:48:38,652][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:48:39,310][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:48:39,969][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:48:40,628][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:48:41,287][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:48:41,945][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:48:42,937][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:48:43,597][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:48:44,255][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:48:44,913][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:48:45,570][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:48:46,227][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:48:46,885][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:48:47,542][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:48:48,199][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:48:48,857][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:48:49,515][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:48:50,172][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:48:50,831][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:48:51,488][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:48:52,146][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:48:52,804][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:48:53,461][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:48:54,237][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 23:48:55,576][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:48:55,578][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:48:55,580][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:48:56,979][__main__][INFO] - Iteration 635 took 54s (13.06% Gen, 84.36% Train). Generation: 7s, Training: 45s. Estimated remaining time: 5h 23m 28s. Estimated total time: 15h 6m 20s. Time estimates for 10 more iterations: 9m 3s, 100 more iterations: 1h 30m 38s, 500 more iterations: 7h 33m 10s. [2026-03-25 23:48:56,981][__main__][INFO] - Starting iteration 635. [2026-03-25 23:48:56,986][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-25 23:48:56,986][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:49:03,262][__main__][INFO] - Number of regex retries in iteration 635: 0 [2026-03-25 23:49:03,264][__main__][INFO] - agents played in iteration 635 are Bob, Alice [2026-03-25 23:49:04,194][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:49:04,255][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:49:04,256][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:49:04,257][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:49:05,098][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:49:05,705][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:49:06,367][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:49:07,024][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:49:07,682][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:49:08,343][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:49:09,001][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:49:09,659][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:49:10,316][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:49:10,974][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:49:11,632][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:49:12,290][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:49:12,948][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:49:13,605][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:49:14,263][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:49:14,923][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:49:15,580][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:49:16,237][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:49:16,895][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:49:17,552][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:49:18,211][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:49:18,868][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:49:19,525][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:49:20,182][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:49:20,839][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:49:21,497][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:49:22,155][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:49:22,813][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:49:23,471][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:49:24,130][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:49:24,790][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:49:25,448][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:49:26,106][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:49:26,763][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:49:27,421][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:49:28,078][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:49:28,740][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:49:29,397][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:49:30,055][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:49:30,712][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:49:31,370][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:49:32,027][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:49:32,686][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:49:33,343][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:49:34,001][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:49:34,659][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:49:35,316][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:49:35,975][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:49:36,974][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:49:37,633][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:49:38,291][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:49:38,951][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:49:39,608][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:49:40,268][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:49:40,924][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:49:41,584][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:49:42,241][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:49:42,901][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:49:43,558][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:49:44,217][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:49:44,874][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:49:45,532][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:49:46,191][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:49:46,850][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:49:47,508][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:49:48,206][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 23:49:49,608][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:49:49,611][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:49:49,612][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:49:50,900][__main__][INFO] - Iteration 636 took 53s (11.64% Gen, 85.96% Train). Generation: 6s, Training: 46s. Estimated remaining time: 5h 14m 51s. Estimated total time: 14h 58m 36s. Time estimates for 10 more iterations: 8m 59s, 100 more iterations: 1h 29m 51s, 500 more iterations: 7h 29m 18s. [2026-03-25 23:49:50,903][__main__][INFO] - Starting iteration 636. [2026-03-25 23:49:50,907][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-25 23:49:50,908][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:49:56,036][__main__][INFO] - Number of regex retries in iteration 636: 0 [2026-03-25 23:49:56,038][__main__][INFO] - agents played in iteration 636 are Bob, Alice [2026-03-25 23:49:56,635][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:49:56,699][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:49:56,699][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:49:56,700][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:49:57,357][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:49:57,970][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:49:58,631][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:49:59,290][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:49:59,949][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:50:00,608][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:50:01,267][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:50:01,927][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:50:02,587][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:50:03,246][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:50:03,905][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:50:04,564][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:50:05,223][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:50:05,881][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:50:06,540][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:50:07,200][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:50:07,858][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:50:08,517][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:50:09,175][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:50:09,834][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:50:10,493][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:50:11,152][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:50:11,811][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:50:12,470][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:50:13,128][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:50:13,786][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:50:14,445][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:50:15,104][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:50:15,763][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:50:16,422][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:50:17,080][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:50:17,738][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:50:18,397][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:50:19,055][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:50:19,714][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:50:20,373][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:50:21,032][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:50:21,690][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:50:22,349][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:50:23,007][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:50:23,665][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:50:24,325][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:50:24,985][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:50:25,643][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:50:26,302][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:50:26,961][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:50:27,619][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:50:28,278][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:50:29,276][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:50:29,935][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:50:30,593][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:50:31,252][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:50:31,909][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:50:32,567][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:50:33,226][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:50:33,883][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:50:34,541][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:50:35,200][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:50:35,858][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:50:36,515][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:50:37,174][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:50:37,832][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:50:38,490][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:50:39,149][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:50:39,807][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:50:40,589][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 23:50:41,900][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:50:41,903][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:50:41,904][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:50:43,530][__main__][INFO] - Iteration 637 took 52s (9.75% Gen, 87.16% Train). Generation: 5s, Training: 45s. Estimated remaining time: 4h 52m 26s. Estimated total time: 14h 37m 5s. Time estimates for 10 more iterations: 8m 46s, 100 more iterations: 1h 27m 42s, 500 more iterations: 7h 18m 32s. [2026-03-25 23:50:43,533][__main__][INFO] - Starting iteration 637. [2026-03-25 23:50:43,538][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-25 23:50:43,538][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:50:48,246][__main__][INFO] - Number of regex retries in iteration 637: 0 [2026-03-25 23:50:48,247][__main__][INFO] - agents played in iteration 637 are Bob, Alice [2026-03-25 23:50:48,743][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:50:48,804][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:50:48,805][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:50:48,805][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:50:49,459][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:50:50,075][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:50:50,735][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:50:51,394][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:50:52,052][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:50:52,711][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:50:53,370][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:50:54,029][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:50:54,687][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:50:55,345][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:50:56,002][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:50:56,661][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:50:57,319][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:50:57,977][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:50:58,636][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:50:59,294][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:50:59,952][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:51:00,610][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:51:01,268][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:51:01,926][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:51:02,584][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:51:03,248][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:51:03,908][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:51:04,564][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:51:05,222][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:51:05,880][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:51:06,538][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:51:07,197][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:51:07,855][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:51:08,514][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:51:09,173][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:51:09,834][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:51:10,493][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:51:11,151][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:51:11,810][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:51:12,469][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:51:13,127][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:51:13,787][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:51:14,448][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:51:15,107][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:51:15,767][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:51:16,425][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:51:17,086][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:51:17,745][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:51:18,404][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:51:19,063][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:51:19,723][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:51:20,382][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:51:21,372][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:51:22,030][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:51:22,688][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:51:23,347][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:51:24,005][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:51:24,662][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:51:25,322][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:51:25,979][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:51:26,637][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:51:27,295][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:51:27,953][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:51:28,611][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:51:29,269][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:51:29,928][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:51:30,587][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:51:31,247][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:51:31,905][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:51:32,694][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 23:51:34,065][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:51:34,068][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:51:34,079][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:51:35,781][__main__][INFO] - Iteration 638 took 52s (9.01% Gen, 87.73% Train). Generation: 4s, Training: 45s. Estimated remaining time: 4h 45m 14s. Estimated total time: 14h 30m 45s. Time estimates for 10 more iterations: 8m 42s, 100 more iterations: 1h 27m 4s, 500 more iterations: 7h 15m 22s. [2026-03-25 23:51:35,783][__main__][INFO] - Starting iteration 638. [2026-03-25 23:51:35,788][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-25 23:51:35,789][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:51:41,323][__main__][INFO] - Number of regex retries in iteration 638: 0 [2026-03-25 23:51:41,323][__main__][INFO] - agents played in iteration 638 are Bob, Alice [2026-03-25 23:51:41,820][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:51:41,887][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:51:41,888][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:51:41,888][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:51:42,583][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:51:43,198][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:51:43,856][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:51:44,518][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:51:45,177][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:51:45,835][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:51:46,492][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:51:47,152][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:51:47,811][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:51:48,469][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:51:49,127][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:51:49,785][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:51:50,443][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:51:51,101][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:51:51,759][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:51:52,418][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:51:53,076][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:51:53,734][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:51:54,391][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:51:55,048][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:51:55,707][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:51:56,365][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:51:57,022][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:51:57,680][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:51:58,340][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:51:58,998][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:51:59,657][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:52:00,315][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:52:00,973][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:52:01,630][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:52:02,287][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:52:02,944][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:52:03,606][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:52:04,265][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:52:04,923][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:52:05,582][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:52:06,241][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:52:06,900][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:52:07,559][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:52:08,216][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:52:08,874][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:52:09,532][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:52:10,190][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:52:10,847][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:52:11,505][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:52:12,163][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:52:12,820][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:52:13,478][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:52:14,489][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:52:15,148][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:52:15,817][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:52:16,476][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:52:17,134][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:52:17,792][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:52:18,450][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:52:19,108][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:52:19,767][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:52:20,424][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:52:21,082][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:52:21,739][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:52:22,397][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:52:23,055][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:52:23,712][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:52:24,369][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:52:25,027][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:52:25,859][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 23:52:27,201][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:52:27,204][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:52:27,205][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:52:28,615][__main__][INFO] - Iteration 639 took 52s (10.48% Gen, 86.85% Train). Generation: 5s, Training: 45s. Estimated remaining time: 4h 54m 6s. Estimated total time: 14h 40m 29s. Time estimates for 10 more iterations: 8m 48s, 100 more iterations: 1h 28m 2s, 500 more iterations: 7h 20m 14s. [2026-03-25 23:52:28,618][__main__][INFO] - Starting iteration 639. [2026-03-25 23:52:28,624][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-25 23:52:28,625][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:52:35,734][__main__][INFO] - Number of regex retries in iteration 639: 0 [2026-03-25 23:52:35,735][__main__][INFO] - agents played in iteration 639 are Bob, Alice [2026-03-25 23:52:36,853][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:52:36,914][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:52:36,914][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:52:36,915][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:52:37,712][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:52:38,327][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:52:38,987][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:52:39,647][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:52:40,309][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:52:40,969][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:52:41,628][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:52:42,288][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:52:42,949][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:52:43,608][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:52:44,266][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:52:44,926][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:52:45,584][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:52:46,244][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:52:46,902][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:52:47,561][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:52:48,219][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:52:48,878][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:52:49,538][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:52:50,196][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:52:50,854][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:52:51,512][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:52:52,171][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:52:52,834][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:52:53,493][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:52:54,151][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:52:54,812][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:52:55,471][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:52:56,130][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:52:56,790][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:52:57,448][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:52:58,107][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:52:58,767][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:52:59,425][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:53:00,084][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:53:00,742][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:53:01,402][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:53:02,061][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:53:02,719][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:53:03,378][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:53:04,036][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:53:04,695][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:53:05,354][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:53:06,012][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:53:06,671][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:53:07,330][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:53:07,989][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:53:08,647][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:53:09,632][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:53:10,291][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:53:10,950][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:53:11,608][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:53:12,265][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:53:12,923][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:53:13,580][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:53:14,238][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:53:14,898][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:53:15,556][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:53:16,214][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:53:16,871][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:53:17,530][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:53:18,188][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:53:18,847][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:53:19,504][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:53:20,162][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:53:20,935][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 23:53:22,430][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:53:22,433][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:53:22,434][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:53:23,847][__main__][INFO] - Iteration 640 took 55s (12.87% Gen, 84.56% Train). Generation: 7s, Training: 46s. Estimated remaining time: 5h 33m 6s. Estimated total time: 15h 20m 25s. Time estimates for 10 more iterations: 9m 12s, 100 more iterations: 1h 32m 2s, 500 more iterations: 7h 40m 12s. [2026-03-25 23:53:23,849][__main__][INFO] - Starting iteration 640. [2026-03-25 23:53:23,855][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-25 23:53:23,856][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:53:29,599][__main__][INFO] - Number of regex retries in iteration 640: 0 [2026-03-25 23:53:29,599][__main__][INFO] - agents played in iteration 640 are Bob, Alice [2026-03-25 23:53:30,203][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:53:30,264][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:53:30,265][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:53:30,265][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:53:31,035][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:53:31,652][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:53:32,313][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:53:32,971][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:53:33,630][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:53:34,288][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:53:34,949][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:53:35,606][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:53:36,266][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:53:36,927][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:53:37,589][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:53:38,251][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:53:38,907][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:53:39,567][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:53:40,226][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:53:40,884][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:53:41,543][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:53:42,201][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:53:42,861][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:53:43,521][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:53:44,180][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:53:44,838][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:53:45,497][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:53:46,155][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:53:46,815][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:53:47,474][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:53:48,134][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:53:48,793][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:53:49,451][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:53:50,110][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:53:50,769][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:53:51,428][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:53:52,086][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:53:52,745][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:53:53,404][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:53:54,063][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:53:54,722][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:53:55,381][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:53:56,041][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:53:56,700][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:53:57,358][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:53:58,017][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:53:58,676][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:53:59,335][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:53:59,994][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:54:00,654][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:54:01,312][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:54:01,971][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:54:03,120][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:54:03,780][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:54:04,439][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:54:05,097][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:54:05,754][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:54:06,415][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:54:07,077][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:54:07,734][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:54:08,395][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:54:09,052][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:54:09,711][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:54:10,371][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:54:11,031][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:54:11,690][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:54:12,349][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:54:13,007][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:54:13,665][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:54:14,550][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 23:54:15,928][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:54:15,932][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:54:15,933][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:54:17,364][__main__][INFO] - Iteration 641 took 53s (10.73% Gen, 86.59% Train). Generation: 5s, Training: 46s. Estimated remaining time: 5h 3m 40s. Estimated total time: 14h 51m 52s. Time estimates for 10 more iterations: 8m 55s, 100 more iterations: 1h 29m 11s, 500 more iterations: 7h 25m 56s. [2026-03-25 23:54:17,367][__main__][INFO] - Starting iteration 641. [2026-03-25 23:54:17,377][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-25 23:54:17,378][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:54:25,087][__main__][INFO] - Number of regex retries in iteration 641: 0 [2026-03-25 23:54:25,089][__main__][INFO] - agents played in iteration 641 are Bob, Alice [2026-03-25 23:54:25,607][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:54:25,670][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:54:25,671][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:54:25,671][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:54:26,377][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:54:27,035][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:54:27,695][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:54:28,355][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:54:29,015][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:54:29,674][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:54:30,332][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:54:30,992][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:54:31,652][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:54:32,310][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:54:32,969][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:54:33,628][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:54:34,286][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:54:34,946][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:54:35,606][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:54:36,265][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:54:36,923][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:54:37,582][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:54:38,241][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:54:38,899][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:54:39,557][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:54:40,216][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:54:40,877][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:54:41,536][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:54:42,194][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:54:42,853][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:54:43,511][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:54:44,170][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:54:44,828][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:54:45,488][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:54:46,148][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:54:46,806][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:54:47,465][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:54:48,125][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:54:48,784][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:54:49,443][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:54:50,102][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:54:50,760][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:54:51,419][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:54:52,078][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:54:52,737][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:54:53,396][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:54:54,055][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:54:54,713][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:54:55,372][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:54:56,032][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:54:56,690][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:54:57,349][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:54:58,333][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:54:58,993][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:54:59,651][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:55:00,309][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:55:00,966][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:55:01,624][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:55:02,283][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:55:02,940][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:55:03,599][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:55:04,259][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:55:04,916][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:55:05,574][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:55:06,232][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:55:06,890][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:55:07,549][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:55:08,206][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:55:08,864][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:55:09,605][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 23:55:10,993][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:55:10,996][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:55:10,997][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:55:13,423][__main__][INFO] - Iteration 642 took 56s (13.76% Gen, 81.90% Train). Generation: 7s, Training: 45s. Estimated remaining time: 5h 45m 5s. Estimated total time: 15h 34m 14s. Time estimates for 10 more iterations: 9m 20s, 100 more iterations: 1h 33m 25s, 500 more iterations: 7h 47m 7s. [2026-03-25 23:55:13,430][__main__][INFO] - Starting iteration 642. [2026-03-25 23:55:13,436][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-25 23:55:13,437][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:55:18,511][__main__][INFO] - Number of regex retries in iteration 642: 0 [2026-03-25 23:55:18,512][__main__][INFO] - agents played in iteration 642 are Bob, Alice [2026-03-25 23:55:19,104][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:55:19,165][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:55:19,165][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:55:19,166][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:55:20,015][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:55:20,632][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:55:21,292][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:55:21,951][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:55:22,612][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:55:23,270][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:55:23,929][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:55:24,587][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:55:25,245][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:55:25,905][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:55:26,562][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:55:27,221][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:55:27,879][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:55:28,537][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:55:29,195][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:55:29,853][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:55:30,510][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:55:31,168][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:55:31,827][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:55:32,486][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:55:33,144][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:55:33,801][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:55:34,459][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:55:35,116][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:55:35,775][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:55:36,433][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:55:37,090][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:55:37,748][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:55:38,406][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:55:39,064][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:55:39,721][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:55:40,379][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:55:41,036][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:55:41,694][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:55:42,353][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:55:43,010][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:55:43,667][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:55:44,325][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:55:44,983][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:55:45,640][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:55:46,298][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:55:46,956][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:55:47,613][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:55:48,271][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:55:48,931][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:55:49,588][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:55:50,246][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:55:50,905][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:55:51,896][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:55:52,556][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:55:53,214][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:55:53,871][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:55:54,529][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:55:55,187][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:55:55,845][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:55:56,503][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:55:57,163][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:55:57,821][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:55:58,479][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:55:59,140][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:55:59,799][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:56:00,456][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:56:01,113][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:56:01,773][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:56:02,432][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:56:03,166][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 23:56:04,576][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:56:04,578][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:56:04,579][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:56:06,225][__main__][INFO] - Iteration 643 took 52s (9.61% Gen, 87.26% Train). Generation: 5s, Training: 46s. Estimated remaining time: 4h 49m 51s. Estimated total time: 14h 39m 52s. Time estimates for 10 more iterations: 8m 47s, 100 more iterations: 1h 27m 59s, 500 more iterations: 7h 19m 56s. [2026-03-25 23:56:06,228][__main__][INFO] - Starting iteration 643. [2026-03-25 23:56:06,232][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-25 23:56:06,233][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:56:11,854][__main__][INFO] - Number of regex retries in iteration 643: 0 [2026-03-25 23:56:11,855][__main__][INFO] - agents played in iteration 643 are Bob, Alice [2026-03-25 23:56:12,773][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:56:12,834][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:56:12,834][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:56:12,835][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:56:13,638][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:56:14,305][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:56:14,965][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:56:15,625][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:56:16,285][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:56:16,945][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:56:17,605][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:56:18,264][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:56:18,922][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:56:19,583][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:56:20,243][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:56:20,903][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:56:21,562][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:56:22,221][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:56:22,880][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:56:23,539][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:56:24,198][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:56:24,856][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:56:25,515][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:56:26,174][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:56:26,833][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:56:27,491][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:56:28,150][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:56:28,808][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:56:29,467][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:56:30,126][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:56:30,785][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:56:31,445][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:56:32,104][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:56:32,768][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:56:33,427][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:56:34,083][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:56:34,742][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:56:35,401][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:56:36,060][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:56:36,718][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:56:37,378][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:56:38,037][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:56:38,695][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:56:39,354][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:56:40,013][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:56:40,671][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:56:41,331][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:56:41,990][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:56:42,648][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:56:43,307][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:56:43,966][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:56:44,626][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:56:45,630][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:56:46,289][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:56:46,950][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:56:47,608][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:56:48,266][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:56:48,924][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:56:49,583][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:56:50,241][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:56:50,899][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:56:51,557][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:56:52,222][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:56:52,878][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:56:53,538][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:56:54,196][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:56:54,854][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:56:55,512][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:56:56,171][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:56:56,951][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 23:56:58,297][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:56:58,300][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:56:58,301][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:56:59,828][__main__][INFO] - Iteration 644 took 53s (10.49% Gen, 86.70% Train). Generation: 5s, Training: 46s. Estimated remaining time: 5h 2m 24s. Estimated total time: 14h 53m 18s. Time estimates for 10 more iterations: 8m 55s, 100 more iterations: 1h 29m 19s, 500 more iterations: 7h 26m 39s. [2026-03-25 23:56:59,830][__main__][INFO] - Starting iteration 644. [2026-03-25 23:56:59,836][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-25 23:56:59,837][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:57:05,522][__main__][INFO] - Number of regex retries in iteration 644: 0 [2026-03-25 23:57:05,524][__main__][INFO] - agents played in iteration 644 are Bob, Alice [2026-03-25 23:57:06,549][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:57:06,616][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:57:06,617][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:57:06,617][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:57:07,300][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:57:07,927][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:57:08,587][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:57:09,246][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:57:09,907][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:57:10,565][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:57:11,226][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:57:11,884][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:57:12,549][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:57:13,207][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:57:13,866][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:57:14,526][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:57:15,185][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:57:15,845][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:57:16,503][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:57:17,162][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:57:17,821][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:57:18,480][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:57:19,139][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:57:19,798][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:57:20,457][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:57:21,115][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:57:21,774][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:57:22,434][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:57:23,094][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:57:23,752][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:57:24,410][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:57:25,069][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:57:25,729][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:57:26,387][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:57:27,045][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:57:27,704][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:57:28,363][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:57:29,022][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:57:29,682][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:57:30,341][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:57:31,000][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:57:31,659][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:57:32,317][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:57:32,975][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:57:33,634][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:57:34,291][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:57:34,950][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:57:35,609][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:57:36,268][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:57:36,926][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:57:37,585][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:57:38,245][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:57:39,247][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:57:39,906][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:57:40,565][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:57:41,229][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:57:41,886][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:57:42,546][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:57:43,204][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:57:43,863][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:57:44,520][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:57:45,179][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:57:45,837][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:57:46,497][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:57:47,153][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:57:47,812][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:57:48,470][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:57:49,128][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:57:49,787][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:57:50,573][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 23:57:51,895][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:57:51,897][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:57:51,898][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:57:53,283][__main__][INFO] - Iteration 645 took 53s (10.64% Gen, 86.76% Train). Generation: 5s, Training: 46s. Estimated remaining time: 4h 59m 2s. Estimated total time: 14h 50m 50s. Time estimates for 10 more iterations: 8m 54s, 100 more iterations: 1h 29m 5s, 500 more iterations: 7h 25m 25s. [2026-03-25 23:57:53,285][__main__][INFO] - Starting iteration 645. [2026-03-25 23:57:53,290][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-25 23:57:53,291][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:57:58,577][__main__][INFO] - Number of regex retries in iteration 645: 0 [2026-03-25 23:57:58,578][__main__][INFO] - agents played in iteration 645 are Bob, Alice [2026-03-25 23:57:59,131][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:57:59,194][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:57:59,195][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:57:59,195][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:57:59,857][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:58:00,475][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:58:01,135][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:58:01,795][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:58:02,454][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:58:03,114][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:58:03,774][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:58:04,432][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:58:05,091][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:58:05,749][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:58:06,409][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:58:07,067][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:58:07,726][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:58:08,385][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:58:09,044][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:58:09,703][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:58:10,362][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:58:11,020][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:58:11,679][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:58:12,338][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:58:12,997][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:58:13,656][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:58:14,314][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:58:14,972][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:58:15,631][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:58:16,290][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:58:16,950][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:58:17,608][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:58:18,266][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:58:18,925][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:58:19,584][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:58:20,242][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:58:20,901][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:58:21,559][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:58:22,218][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:58:22,882][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:58:23,539][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:58:24,198][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:58:24,860][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:58:25,520][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:58:26,179][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:58:26,838][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:58:27,497][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:58:28,156][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:58:28,815][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:58:29,474][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:58:30,134][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:58:30,793][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:58:31,807][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:58:32,466][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:58:33,123][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:58:33,781][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:58:34,439][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:58:35,098][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:58:35,756][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:58:36,414][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:58:37,072][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:58:37,730][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:58:38,387][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:58:39,046][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:58:39,707][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:58:40,368][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:58:41,027][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:58:41,684][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:58:42,343][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:58:43,142][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 23:58:45,132][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:58:45,135][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:58:45,136][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:58:46,567][__main__][INFO] - Iteration 646 took 53s (9.92% Gen, 87.38% Train). Generation: 5s, Training: 46s. Estimated remaining time: 4h 55m 18s. Estimated total time: 14h 47m 59s. Time estimates for 10 more iterations: 8m 52s, 100 more iterations: 1h 28m 47s, 500 more iterations: 7h 23m 59s. [2026-03-25 23:58:46,569][__main__][INFO] - Starting iteration 646. [2026-03-25 23:58:46,574][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-25 23:58:46,575][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:58:52,937][__main__][INFO] - Number of regex retries in iteration 646: 0 [2026-03-25 23:58:52,938][__main__][INFO] - agents played in iteration 646 are Bob, Alice [2026-03-25 23:58:53,422][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:58:53,482][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:58:53,483][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:58:53,483][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:58:54,312][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:58:54,931][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:58:55,592][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:58:56,251][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:58:56,910][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:58:57,569][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:58:58,228][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:58:58,891][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:58:59,555][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:59:00,212][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:59:00,872][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:59:01,532][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:59:02,190][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:59:02,853][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:59:03,515][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:59:04,173][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:59:04,836][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:59:05,497][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:59:06,157][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:59:06,818][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:59:07,477][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:59:08,136][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:59:08,795][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:59:09,453][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:59:10,113][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:59:10,773][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:59:11,431][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:59:12,090][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:59:12,750][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:59:13,409][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:59:14,068][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:59:14,727][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:59:15,386][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:59:16,045][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:59:16,703][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:59:17,363][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:59:18,022][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:59:18,681][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:59:19,339][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:59:19,999][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:59:20,658][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:59:21,317][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:59:21,975][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:59:22,633][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:59:23,292][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:59:23,954][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:59:24,614][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:59:25,272][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:59:26,270][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:59:26,929][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:59:27,588][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:59:28,247][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:59:28,905][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:59:29,565][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:59:30,222][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:59:30,881][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:59:31,540][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:59:32,199][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:59:32,857][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:59:33,514][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:59:34,172][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:59:34,829][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:59:35,488][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:59:36,146][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:59:36,804][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:59:37,593][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-25 23:59:38,931][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:59:38,933][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:59:38,935][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:59:40,366][__main__][INFO] - Iteration 647 took 53s (11.83% Gen, 85.50% Train). Generation: 6s, Training: 45s. Estimated remaining time: 5h 3m 0s. Estimated total time: 14h 56m 35s. Time estimates for 10 more iterations: 8m 57s, 100 more iterations: 1h 29m 39s, 500 more iterations: 7h 28m 17s. [2026-03-25 23:59:40,369][__main__][INFO] - Starting iteration 647. [2026-03-25 23:59:40,373][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-25 23:59:40,374][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:59:46,238][__main__][INFO] - Number of regex retries in iteration 647: 0 [2026-03-25 23:59:46,239][__main__][INFO] - agents played in iteration 647 are Bob, Alice [2026-03-25 23:59:46,740][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:59:46,800][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-25 23:59:46,801][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:59:46,802][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:59:47,665][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:59:48,281][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:59:48,943][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:59:49,604][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:59:50,264][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:59:50,924][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:59:51,583][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:59:52,242][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:59:52,901][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:59:53,559][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:59:54,218][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:59:54,876][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:59:55,535][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:59:56,195][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:59:56,854][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:59:57,512][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:59:58,173][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:59:58,832][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:59:59,490][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:00:00,148][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:00:00,807][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:00:01,466][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:00:02,125][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:00:02,783][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:00:03,441][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:00:04,099][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:00:04,758][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:00:05,417][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:00:06,075][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:00:06,734][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:00:07,393][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:00:08,051][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:00:08,710][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:00:09,369][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:00:10,029][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:00:10,687][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:00:11,345][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:00:12,003][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:00:12,667][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:00:13,325][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:00:13,984][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:00:14,642][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:00:15,301][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:00:15,961][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:00:16,619][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:00:17,278][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:00:17,937][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:00:18,598][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:00:19,601][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:00:20,259][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:00:20,918][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:00:21,577][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:00:22,236][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:00:22,893][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:00:23,553][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:00:24,210][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:00:24,868][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:00:25,526][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:00:26,183][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:00:26,841][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:00:27,499][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:00:28,157][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:00:28,814][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:00:29,473][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:00:30,131][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:00:30,944][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 00:00:32,261][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:00:32,266][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:00:32,268][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:00:33,901][__main__][INFO] - Iteration 648 took 53s (10.96% Gen, 85.98% Train). Generation: 5s, Training: 46s. Estimated remaining time: 4h 57m 42s. Estimated total time: 14h 52m 10s. Time estimates for 10 more iterations: 8m 55s, 100 more iterations: 1h 29m 13s, 500 more iterations: 7h 26m 5s. [2026-03-26 00:00:33,903][__main__][INFO] - Starting iteration 648. [2026-03-26 00:00:33,909][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-26 00:00:33,909][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:00:49,683][__main__][INFO] - Number of regex retries in iteration 648: 0 [2026-03-26 00:00:49,684][__main__][INFO] - agents played in iteration 648 are Bob, Alice [2026-03-26 00:00:50,781][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:00:50,849][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:00:50,849][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:00:50,850][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:00:51,579][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:00:52,182][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:00:52,841][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:00:53,498][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:00:54,161][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:00:54,816][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:00:55,474][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:00:56,131][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:00:56,789][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:00:57,447][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:00:58,104][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:00:58,764][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:00:59,421][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:01:00,078][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:01:00,736][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:01:01,393][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:01:02,050][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:01:02,709][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:01:03,366][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:01:04,027][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:01:04,685][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:01:05,342][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:01:05,999][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:01:06,656][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:01:07,314][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:01:07,971][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:01:08,629][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:01:09,286][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:01:09,944][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:01:10,601][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:01:11,259][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:01:11,917][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:01:12,580][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:01:13,238][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:01:13,896][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:01:14,554][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:01:15,212][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:01:15,869][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:01:16,527][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:01:17,185][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:01:17,843][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:01:18,501][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:01:19,158][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:01:19,816][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:01:20,473][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:01:21,131][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:01:21,789][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:01:22,447][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:01:23,439][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:01:24,097][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:01:24,756][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:01:25,415][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:01:26,073][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:01:26,732][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:01:27,389][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:01:28,047][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:01:28,705][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:01:29,363][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:01:30,020][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:01:30,678][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:01:31,339][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:01:31,998][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:01:32,657][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:01:33,315][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:01:33,973][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:01:34,753][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 00:01:36,103][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:01:36,106][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:01:36,107][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:01:37,433][__main__][INFO] - Iteration 649 took 1m 3s (24.83% Gen, 73.07% Train). Generation: 15s, Training: 46s. Estimated remaining time: 7h 43m 15s. Estimated total time: 17h 38m 47s. Time estimates for 10 more iterations: 10m 35s, 100 more iterations: 1h 45m 52s, 500 more iterations: 8h 49m 23s. [2026-03-26 00:01:37,436][__main__][INFO] - Starting iteration 649. [2026-03-26 00:01:37,440][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-26 00:01:37,441][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:01:42,295][__main__][INFO] - Number of regex retries in iteration 649: 0 [2026-03-26 00:01:42,295][__main__][INFO] - agents played in iteration 649 are Bob, Alice [2026-03-26 00:01:42,860][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:01:42,921][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:01:42,922][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:01:42,922][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:01:43,737][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:01:44,354][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:01:45,013][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:01:45,671][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:01:46,329][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:01:46,988][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:01:47,645][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:01:48,307][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:01:48,964][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:01:49,626][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:01:50,282][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:01:50,940][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:01:51,599][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:01:52,256][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:01:52,916][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:01:53,573][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:01:54,231][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:01:54,890][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:01:55,549][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:01:56,206][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:01:56,866][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:01:57,524][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:01:58,182][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:01:58,841][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:01:59,504][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:02:00,161][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:02:00,820][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:02:01,481][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:02:02,138][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:02:02,797][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:02:03,454][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:02:04,111][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:02:04,770][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:02:05,427][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:02:06,085][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:02:06,743][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:02:07,402][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:02:08,061][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:02:08,718][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:02:09,378][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:02:10,036][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:02:10,694][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:02:11,352][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:02:12,010][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:02:12,667][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:02:13,325][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:02:13,984][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:02:14,643][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:02:15,642][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:02:16,301][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:02:16,961][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:02:17,617][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:02:18,275][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:02:18,933][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:02:19,591][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:02:20,249][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:02:20,907][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:02:21,569][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:02:22,226][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:02:22,884][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:02:23,542][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:02:24,199][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:02:24,858][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:02:25,516][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:02:26,174][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:02:26,984][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 00:02:28,322][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:02:28,325][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:02:28,326][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:02:30,065][__main__][INFO] - Iteration 650 took 52s (9.22% Gen, 87.46% Train). Generation: 4s, Training: 46s. Estimated remaining time: 4h 40m 42s. Estimated total time: 14h 37m 7s. Time estimates for 10 more iterations: 8m 46s, 100 more iterations: 1h 27m 42s, 500 more iterations: 7h 18m 33s. [2026-03-26 00:02:30,067][__main__][INFO] - Starting iteration 650. [2026-03-26 00:02:30,072][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-26 00:02:30,073][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:02:35,438][__main__][INFO] - Number of regex retries in iteration 650: 0 [2026-03-26 00:02:35,440][__main__][INFO] - agents played in iteration 650 are Bob, Alice [2026-03-26 00:02:35,978][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:02:36,040][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:02:36,040][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:02:36,041][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:02:36,843][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:02:37,462][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:02:38,121][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:02:38,780][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:02:39,440][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:02:40,098][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:02:40,756][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:02:41,414][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:02:42,072][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:02:42,730][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:02:43,388][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:02:44,047][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:02:44,704][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:02:45,362][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:02:46,020][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:02:46,678][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:02:47,335][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:02:47,993][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:02:48,652][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:02:49,310][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:02:49,968][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:02:50,626][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:02:51,284][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:02:51,942][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:02:52,601][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:02:53,264][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:02:53,922][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:02:54,581][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:02:55,240][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:02:55,898][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:02:56,557][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:02:57,214][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:02:57,873][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:02:58,533][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:02:59,191][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:02:59,849][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:03:00,510][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:03:01,169][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:03:01,829][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:03:02,487][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:03:03,147][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:03:03,805][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:03:04,462][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:03:05,121][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:03:05,779][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:03:06,438][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:03:07,096][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:03:07,754][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:03:08,746][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:03:09,407][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:03:10,065][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:03:10,723][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:03:11,381][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:03:12,039][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:03:12,701][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:03:13,359][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:03:14,017][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:03:14,676][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:03:15,334][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:03:15,992][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:03:16,652][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:03:17,311][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:03:17,970][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:03:18,629][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:03:19,290][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:03:20,084][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 00:03:21,429][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:03:21,432][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:03:21,433][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:03:24,819][__main__][INFO] - Iteration 651 took 54s (9.80% Gen, 84.01% Train). Generation: 5s, Training: 45s. Estimated remaining time: 5h 15m 10s. Estimated total time: 15h 12m 29s. Time estimates for 10 more iterations: 9m 7s, 100 more iterations: 1h 31m 14s, 500 more iterations: 7h 36m 14s. [2026-03-26 00:03:24,822][__main__][INFO] - Starting iteration 651. [2026-03-26 00:03:24,827][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 00:03:24,828][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:03:29,743][__main__][INFO] - Number of regex retries in iteration 651: 0 [2026-03-26 00:03:29,744][__main__][INFO] - agents played in iteration 651 are Bob, Alice [2026-03-26 00:03:30,236][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:03:30,297][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:03:30,298][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:03:30,298][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:03:31,019][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:03:31,637][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:03:32,297][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:03:32,956][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:03:33,614][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:03:34,272][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:03:34,933][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:03:35,591][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:03:36,250][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:03:36,908][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:03:37,566][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:03:38,225][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:03:38,882][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:03:39,542][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:03:40,200][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:03:40,858][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:03:41,517][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:03:42,176][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:03:42,834][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:03:43,492][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:03:44,150][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:03:44,807][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:03:45,465][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:03:46,123][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:03:46,782][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:03:47,440][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:03:48,098][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:03:48,757][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:03:49,415][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:03:50,073][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:03:50,732][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:03:51,390][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:03:52,047][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:03:52,705][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:03:53,364][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:03:54,023][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:03:54,681][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:03:55,338][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:03:55,996][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:03:56,654][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:03:57,311][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:03:57,969][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:03:58,627][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:03:59,285][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:03:59,943][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:04:00,601][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:04:01,259][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:04:01,920][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:04:02,924][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:04:03,584][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:04:04,245][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:04:04,904][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:04:05,564][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:04:06,222][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:04:06,881][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:04:07,541][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:04:08,200][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:04:08,859][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:04:09,518][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:04:10,178][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:04:10,836][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:04:11,495][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:04:12,152][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:04:12,810][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:04:13,468][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:04:14,229][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 00:04:15,551][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:04:15,554][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:04:15,555][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:04:16,973][__main__][INFO] - Iteration 652 took 52s (9.43% Gen, 87.85% Train). Generation: 4s, Training: 45s. Estimated remaining time: 4h 30m 56s. Estimated total time: 14h 29m 8s. Time estimates for 10 more iterations: 8m 41s, 100 more iterations: 1h 26m 54s, 500 more iterations: 7h 14m 34s. [2026-03-26 00:04:16,975][__main__][INFO] - Starting iteration 652. [2026-03-26 00:04:16,981][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 00:04:16,982][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:04:32,215][__main__][INFO] - Number of regex retries in iteration 652: 0 [2026-03-26 00:04:32,217][__main__][INFO] - agents played in iteration 652 are Bob, Alice [2026-03-26 00:04:33,301][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:04:33,364][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:04:33,364][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:04:33,365][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:04:34,070][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:04:34,678][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:04:35,337][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:04:35,997][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:04:36,654][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:04:37,312][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:04:37,974][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:04:38,632][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:04:39,290][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:04:39,948][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:04:40,606][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:04:41,264][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:04:41,922][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:04:42,580][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:04:43,239][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:04:43,897][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:04:44,554][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:04:45,216][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:04:45,874][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:04:46,541][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:04:47,200][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:04:47,859][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:04:48,517][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:04:49,175][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:04:49,835][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:04:50,493][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:04:51,151][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:04:51,808][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:04:52,466][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:04:53,123][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:04:53,780][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:04:54,438][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:04:55,095][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:04:55,753][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:04:56,411][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:04:57,069][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:04:57,727][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:04:58,385][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:04:59,045][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:04:59,703][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:05:00,360][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:05:01,018][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:05:01,676][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:05:02,334][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:05:02,993][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:05:03,651][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:05:04,308][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:05:04,966][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:05:05,959][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:05:06,618][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:05:07,279][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:05:07,936][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:05:08,594][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:05:09,253][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:05:09,913][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:05:10,572][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:05:11,231][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:05:11,891][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:05:12,548][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:05:13,207][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:05:13,866][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:05:14,523][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:05:15,181][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:05:15,839][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:05:16,500][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:05:17,301][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 00:05:18,663][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:05:18,665][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:05:18,666][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:05:20,229][__main__][INFO] - Iteration 653 took 1m 3s (24.09% Gen, 73.44% Train). Generation: 15s, Training: 46s. Estimated remaining time: 7h 34m 56s. Estimated total time: 17h 34m 11s. Time estimates for 10 more iterations: 10m 32s, 100 more iterations: 1h 45m 25s, 500 more iterations: 8h 47m 5s. [2026-03-26 00:05:20,231][__main__][INFO] - Starting iteration 653. [2026-03-26 00:05:20,236][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 00:05:20,236][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:05:28,286][__main__][INFO] - Number of regex retries in iteration 653: 0 [2026-03-26 00:05:28,287][__main__][INFO] - agents played in iteration 653 are Bob, Alice [2026-03-26 00:05:28,786][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:05:28,848][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:05:28,848][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:05:28,849][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:05:29,715][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:05:30,329][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:05:30,988][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:05:31,645][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:05:32,304][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:05:32,961][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:05:33,621][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:05:34,278][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:05:34,936][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:05:35,593][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:05:36,251][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:05:36,908][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:05:37,566][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:05:38,224][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:05:38,881][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:05:39,540][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:05:40,198][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:05:40,855][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:05:41,513][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:05:42,170][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:05:42,828][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:05:43,485][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:05:44,144][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:05:44,802][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:05:45,460][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:05:46,117][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:05:46,774][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:05:47,432][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:05:48,089][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:05:48,746][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:05:49,411][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:05:50,070][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:05:50,729][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:05:51,388][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:05:52,046][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:05:52,704][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:05:53,363][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:05:54,021][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:05:54,679][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:05:55,338][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:05:55,996][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:05:56,655][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:05:57,314][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:05:57,972][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:05:58,631][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:05:59,289][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:05:59,947][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:06:00,608][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:06:01,609][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:06:02,270][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:06:02,928][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:06:03,586][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:06:04,244][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:06:04,903][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:06:05,563][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:06:06,223][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:06:06,881][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:06:07,540][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:06:08,197][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:06:08,855][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:06:09,513][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:06:10,172][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:06:10,830][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:06:11,488][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:06:12,148][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:06:12,935][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 00:06:14,299][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:06:14,302][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:06:14,303][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:06:16,037][__main__][INFO] - Iteration 654 took 55s (14.43% Gen, 82.46% Train). Generation: 8s, Training: 46s. Estimated remaining time: 5h 29m 52s. Estimated total time: 15h 30m 3s. Time estimates for 10 more iterations: 9m 18s, 100 more iterations: 1h 33m 0s, 500 more iterations: 7h 45m 1s. [2026-03-26 00:06:16,039][__main__][INFO] - Starting iteration 654. [2026-03-26 00:06:16,044][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 00:06:16,044][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:06:24,062][__main__][INFO] - Number of regex retries in iteration 654: 0 [2026-03-26 00:06:24,063][__main__][INFO] - agents played in iteration 654 are Bob, Alice [2026-03-26 00:06:24,667][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:06:24,729][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:06:24,730][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:06:24,730][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:06:25,613][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:06:26,219][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:06:26,878][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:06:27,536][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:06:28,195][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:06:28,853][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:06:29,510][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:06:30,171][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:06:30,829][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:06:31,488][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:06:32,145][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:06:32,802][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:06:33,463][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:06:34,122][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:06:34,782][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:06:35,439][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:06:36,099][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:06:36,757][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:06:37,416][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:06:38,074][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:06:38,732][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:06:39,391][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:06:40,051][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:06:40,709][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:06:41,366][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:06:42,024][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:06:42,683][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:06:43,341][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:06:43,999][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:06:44,657][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:06:45,315][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:06:45,973][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:06:46,632][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:06:47,290][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:06:47,948][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:06:48,606][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:06:49,265][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:06:49,923][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:06:50,581][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:06:51,239][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:06:51,896][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:06:52,554][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:06:53,212][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:06:53,870][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:06:54,527][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:06:55,185][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:06:55,843][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:06:56,501][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:06:57,492][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:06:58,154][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:06:58,815][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:06:59,473][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:07:00,131][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:07:00,791][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:07:01,450][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:07:02,110][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:07:02,768][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:07:03,428][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:07:04,088][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:07:04,747][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:07:05,406][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:07:06,064][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:07:06,723][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:07:07,382][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:07:08,041][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:07:08,857][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 00:07:10,181][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:07:10,183][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:07:10,185][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:07:11,693][__main__][INFO] - Iteration 655 took 55s (14.41% Gen, 82.87% Train). Generation: 8s, Training: 46s. Estimated remaining time: 5h 26m 25s. Estimated total time: 15h 27m 31s. Time estimates for 10 more iterations: 9m 16s, 100 more iterations: 1h 32m 45s, 500 more iterations: 7h 43m 45s. [2026-03-26 00:07:11,695][__main__][INFO] - Starting iteration 655. [2026-03-26 00:07:11,700][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 00:07:11,701][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:07:18,831][__main__][INFO] - Number of regex retries in iteration 655: 0 [2026-03-26 00:07:18,832][__main__][INFO] - agents played in iteration 655 are Bob, Alice [2026-03-26 00:07:19,702][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:07:19,763][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:07:19,764][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:07:19,764][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:07:20,541][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:07:21,145][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:07:21,805][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:07:22,465][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:07:23,124][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:07:23,784][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:07:24,443][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:07:25,102][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:07:25,762][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:07:26,421][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:07:27,080][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:07:27,738][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:07:28,397][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:07:29,055][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:07:29,714][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:07:30,372][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:07:31,030][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:07:31,689][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:07:32,347][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:07:33,006][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:07:33,664][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:07:34,322][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:07:34,981][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:07:35,639][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:07:36,298][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:07:36,958][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:07:37,619][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:07:38,279][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:07:38,938][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:07:39,596][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:07:40,255][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:07:40,914][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:07:41,573][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:07:42,231][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:07:42,890][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:07:43,548][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:07:44,207][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:07:44,866][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:07:45,526][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:07:46,186][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:07:46,845][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:07:47,503][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:07:48,162][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:07:48,821][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:07:49,479][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:07:50,139][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:07:50,798][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:07:51,456][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:07:52,445][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:07:53,104][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:07:53,763][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:07:54,423][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:07:55,082][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:07:55,741][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:07:56,398][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:07:57,057][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:07:57,716][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:07:58,375][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:07:59,034][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:07:59,692][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:08:00,350][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:08:01,008][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:08:01,666][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:08:02,324][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:08:02,981][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:08:03,781][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 00:08:05,150][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:08:05,153][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:08:05,154][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:08:06,799][__main__][INFO] - Iteration 656 took 55s (12.94% Gen, 84.07% Train). Generation: 7s, Training: 46s. Estimated remaining time: 5h 16m 20s. Estimated total time: 15h 18m 22s. Time estimates for 10 more iterations: 9m 11s, 100 more iterations: 1h 31m 50s, 500 more iterations: 7h 39m 11s. [2026-03-26 00:08:06,801][__main__][INFO] - Starting iteration 656. [2026-03-26 00:08:06,807][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 00:08:06,808][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:08:16,259][__main__][INFO] - Number of regex retries in iteration 656: 0 [2026-03-26 00:08:16,260][__main__][INFO] - agents played in iteration 656 are Bob, Alice [2026-03-26 00:08:16,865][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:08:16,926][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:08:16,927][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:08:16,927][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:08:17,718][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:08:18,337][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:08:18,997][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:08:19,655][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:08:20,316][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:08:20,975][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:08:21,632][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:08:22,292][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:08:22,951][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:08:23,615][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:08:24,274][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:08:24,934][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:08:25,592][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:08:26,259][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:08:26,917][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:08:27,578][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:08:28,237][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:08:28,896][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:08:29,556][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:08:30,215][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:08:30,876][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:08:31,534][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:08:32,192][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:08:32,850][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:08:33,507][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:08:34,166][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:08:34,824][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:08:35,482][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:08:36,141][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:08:36,799][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:08:37,457][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:08:38,115][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:08:38,774][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:08:39,432][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:08:40,091][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:08:40,751][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:08:41,409][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:08:42,067][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:08:42,725][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:08:43,383][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:08:44,040][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:08:44,698][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:08:45,355][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:08:46,013][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:08:46,671][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:08:47,329][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:08:47,987][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:08:48,646][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:08:49,627][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:08:50,286][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:08:50,946][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:08:51,604][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:08:52,261][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:08:52,920][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:08:53,578][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:08:54,236][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:08:54,894][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:08:55,552][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:08:56,210][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:08:56,868][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:08:57,527][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:08:58,184][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:08:58,843][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:08:59,501][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:09:00,160][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:09:00,918][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 00:09:02,249][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:09:02,251][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:09:02,253][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:09:03,719][__main__][INFO] - Iteration 657 took 56s (16.61% Gen, 80.81% Train). Generation: 9s, Training: 45s. Estimated remaining time: 5h 45m 37s. Estimated total time: 15h 48m 35s. Time estimates for 10 more iterations: 9m 29s, 100 more iterations: 1h 34m 51s, 500 more iterations: 7h 54m 17s. [2026-03-26 00:09:03,721][__main__][INFO] - Starting iteration 657. [2026-03-26 00:09:03,726][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 00:09:03,726][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:09:09,038][__main__][INFO] - Number of regex retries in iteration 657: 0 [2026-03-26 00:09:09,039][__main__][INFO] - agents played in iteration 657 are Bob, Alice [2026-03-26 00:09:09,535][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:09:09,595][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:09:09,596][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:09:09,596][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:09:10,378][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:09:10,992][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:09:11,652][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:09:12,314][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:09:12,974][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:09:13,633][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:09:14,293][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:09:14,953][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:09:15,612][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:09:16,270][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:09:16,931][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:09:17,590][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:09:18,250][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:09:18,909][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:09:19,569][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:09:20,227][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:09:20,888][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:09:21,548][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:09:22,208][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:09:22,867][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:09:23,527][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:09:24,187][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:09:24,848][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:09:25,508][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:09:26,168][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:09:26,826][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:09:27,486][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:09:28,147][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:09:28,807][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:09:29,465][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:09:30,124][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:09:30,783][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:09:31,442][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:09:32,102][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:09:32,761][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:09:33,422][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:09:34,081][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:09:34,739][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:09:35,398][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:09:36,056][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:09:36,715][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:09:37,374][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:09:38,033][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:09:38,693][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:09:39,353][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:09:40,012][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:09:40,671][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:09:41,330][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:09:42,321][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:09:42,981][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:09:43,640][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:09:44,298][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:09:44,956][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:09:45,614][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:09:46,273][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:09:46,931][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:09:47,590][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:09:48,248][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:09:48,907][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:09:49,565][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:09:50,223][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:09:50,882][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:09:51,539][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:09:52,197][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:09:52,856][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:09:53,649][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 00:09:54,976][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:09:54,979][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:09:54,980][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:09:56,377][__main__][INFO] - Iteration 658 took 52s (10.09% Gen, 87.25% Train). Generation: 5s, Training: 45s. Estimated remaining time: 4h 33m 42s. Estimated total time: 14h 37m 33s. Time estimates for 10 more iterations: 8m 46s, 100 more iterations: 1h 27m 45s, 500 more iterations: 7h 18m 46s. [2026-03-26 00:09:56,379][__main__][INFO] - Starting iteration 658. [2026-03-26 00:09:56,384][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 00:09:56,385][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:10:02,853][__main__][INFO] - Number of regex retries in iteration 658: 0 [2026-03-26 00:10:02,854][__main__][INFO] - agents played in iteration 658 are Bob, Alice [2026-03-26 00:10:03,454][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:10:03,517][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:10:03,517][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:10:03,518][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:10:04,247][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:10:04,858][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:10:05,518][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:10:06,178][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:10:06,836][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:10:07,495][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:10:08,153][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:10:08,810][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:10:09,469][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:10:10,126][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:10:10,786][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:10:11,443][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:10:12,101][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:10:12,760][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:10:13,418][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:10:14,075][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:10:14,733][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:10:15,392][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:10:16,050][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:10:16,708][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:10:17,366][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:10:18,024][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:10:18,683][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:10:19,342][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:10:20,000][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:10:20,658][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:10:21,316][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:10:21,974][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:10:22,632][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:10:23,291][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:10:23,949][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:10:24,607][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:10:25,265][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:10:25,923][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:10:26,582][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:10:27,240][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:10:27,897][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:10:28,555][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:10:29,214][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:10:29,871][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:10:30,529][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:10:31,186][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:10:31,844][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:10:32,502][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:10:33,160][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:10:33,818][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:10:34,477][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:10:35,135][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:10:36,115][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:10:36,774][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:10:37,434][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:10:38,093][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:10:38,752][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:10:39,411][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:10:40,070][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:10:40,729][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:10:41,387][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:10:42,046][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:10:42,704][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:10:43,363][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:10:44,021][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:10:44,680][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:10:45,339][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:10:45,998][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:10:46,656][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:10:47,516][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 00:10:48,846][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:10:48,848][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:10:48,850][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:10:50,266][__main__][INFO] - Iteration 659 took 53s (12.01% Gen, 85.36% Train). Generation: 6s, Training: 45s. Estimated remaining time: 4h 53m 19s. Estimated total time: 14h 58m 4s. Time estimates for 10 more iterations: 8m 58s, 100 more iterations: 1h 29m 48s, 500 more iterations: 7h 29m 2s. [2026-03-26 00:10:50,268][__main__][INFO] - Starting iteration 659. [2026-03-26 00:10:50,273][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 00:10:50,274][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:10:55,523][__main__][INFO] - Number of regex retries in iteration 659: 0 [2026-03-26 00:10:55,524][__main__][INFO] - agents played in iteration 659 are Bob, Alice [2026-03-26 00:10:56,043][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:10:56,105][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:10:56,106][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:10:56,106][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:10:56,863][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:10:57,474][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:10:58,134][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:10:58,792][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:10:59,451][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:11:00,109][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:11:00,767][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:11:01,425][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:11:02,083][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:11:02,740][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:11:03,400][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:11:04,057][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:11:04,715][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:11:05,373][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:11:06,034][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:11:06,691][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:11:07,349][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:11:08,006][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:11:08,664][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:11:09,321][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:11:09,979][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:11:10,638][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:11:11,295][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:11:11,954][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:11:12,612][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:11:13,270][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:11:13,927][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:11:14,585][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:11:15,242][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:11:15,900][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:11:16,558][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:11:17,216][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:11:17,873][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:11:18,531][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:11:19,189][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:11:19,847][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:11:20,505][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:11:21,164][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:11:21,822][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:11:22,480][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:11:23,139][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:11:23,796][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:11:24,454][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:11:25,113][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:11:25,771][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:11:26,429][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:11:27,087][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:11:27,745][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:11:28,792][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:11:29,451][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:11:30,109][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:11:30,767][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:11:31,425][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:11:32,086][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:11:32,745][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:11:33,408][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:11:34,064][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:11:34,723][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:11:35,381][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:11:36,039][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:11:36,696][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:11:38,265][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:11:41,226][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:11:41,884][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:11:42,542][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:11:43,335][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:46 [2026-03-26 00:11:44,661][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:11:44,663][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:11:44,664][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:11:46,420][__main__][INFO] - Iteration 660 took 56s (9.35% Gen, 87.52% Train). Generation: 5s, Training: 49s. Estimated remaining time: 5h 30m 8s. Estimated total time: 15h 35m 49s. Time estimates for 10 more iterations: 9m 21s, 100 more iterations: 1h 33m 34s, 500 more iterations: 7h 47m 54s. [2026-03-26 00:11:46,422][__main__][INFO] - Starting iteration 660. [2026-03-26 00:11:46,427][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 00:11:46,428][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:11:51,949][__main__][INFO] - Number of regex retries in iteration 660: 0 [2026-03-26 00:11:51,950][__main__][INFO] - agents played in iteration 660 are Bob, Alice [2026-03-26 00:11:52,812][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:11:52,872][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:11:52,873][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:11:52,873][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:11:53,747][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:11:54,357][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:11:55,017][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:11:55,678][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:11:56,340][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:11:56,999][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:11:57,659][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:11:58,319][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:11:58,977][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:11:59,637][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:12:00,295][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:12:00,953][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:12:01,611][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:12:02,270][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:12:02,928][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:12:03,587][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:12:04,246][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:12:04,904][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:12:05,564][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:12:06,222][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:12:06,880][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:12:07,538][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:12:08,197][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:12:08,856][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:12:09,514][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:12:10,173][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:12:10,831][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:12:11,490][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:12:12,149][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:12:12,808][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:12:13,466][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:12:14,124][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:12:14,783][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:12:15,441][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:12:16,100][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:12:16,759][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:12:17,418][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:12:18,077][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:12:18,735][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:12:19,394][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:12:20,054][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:12:20,713][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:12:21,371][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:12:22,031][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:12:22,690][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:12:23,348][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:12:24,008][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:12:24,666][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:12:25,706][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:12:26,365][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:12:27,024][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:12:27,683][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:12:28,342][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:12:29,000][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:12:29,660][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:12:30,319][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:12:30,977][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:12:31,634][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:12:32,293][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:12:32,952][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:12:33,611][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:12:34,270][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:12:34,928][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:12:35,586][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:12:36,244][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:12:37,034][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 00:12:38,362][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:12:38,365][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:12:38,366][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:12:39,864][__main__][INFO] - Iteration 661 took 53s (10.33% Gen, 86.86% Train). Generation: 5s, Training: 46s. Estimated remaining time: 4h 44m 5s. Estimated total time: 14h 50m 39s. Time estimates for 10 more iterations: 8m 54s, 100 more iterations: 1h 29m 3s, 500 more iterations: 7h 25m 19s. [2026-03-26 00:12:39,867][__main__][INFO] - Starting iteration 661. [2026-03-26 00:12:39,870][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 00:12:39,871][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:12:45,218][__main__][INFO] - Number of regex retries in iteration 661: 0 [2026-03-26 00:12:45,219][__main__][INFO] - agents played in iteration 661 are Bob, Alice [2026-03-26 00:12:45,814][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:12:45,875][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:12:45,876][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:12:45,876][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:12:46,747][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:12:47,367][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:12:48,029][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:12:48,687][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:12:49,348][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:12:50,008][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:12:50,667][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:12:51,326][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:12:51,986][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:12:52,645][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:12:53,306][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:12:53,964][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:12:54,622][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:12:55,281][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:12:55,941][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:12:56,599][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:12:57,258][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:12:57,917][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:12:58,580][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:12:59,239][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:12:59,897][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:13:00,555][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:13:01,214][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:13:01,873][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:13:02,531][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:13:03,190][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:13:03,848][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:13:04,507][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:13:05,166][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:13:05,826][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:13:06,486][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:13:07,145][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:13:07,803][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:13:08,461][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:13:09,120][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:13:09,779][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:13:10,437][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:13:11,097][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:13:11,756][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:13:12,416][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:13:13,075][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:13:13,735][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:13:14,393][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:13:15,052][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:13:15,713][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:13:16,370][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:13:17,028][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:13:17,686][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:13:18,690][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:13:19,349][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:13:20,008][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:13:20,667][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:13:21,325][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:13:21,983][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:13:22,641][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:13:23,300][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:13:23,958][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:13:24,616][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:13:25,274][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:13:25,932][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:13:26,590][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:13:27,742][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:13:28,398][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:13:29,056][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:13:29,716][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:13:30,566][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 00:13:31,924][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:13:31,926][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:13:31,928][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:13:33,453][__main__][INFO] - Iteration 662 took 53s (9.98% Gen, 87.17% Train). Generation: 5s, Training: 46s. Estimated remaining time: 4h 45m 36s. Estimated total time: 14h 53m 4s. Time estimates for 10 more iterations: 8m 55s, 100 more iterations: 1h 29m 18s, 500 more iterations: 7h 26m 32s. [2026-03-26 00:13:33,455][__main__][INFO] - Starting iteration 662. [2026-03-26 00:13:33,460][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 00:13:33,461][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:13:38,843][__main__][INFO] - Number of regex retries in iteration 662: 0 [2026-03-26 00:13:38,845][__main__][INFO] - agents played in iteration 662 are Bob, Alice [2026-03-26 00:13:39,339][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:13:39,400][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:13:39,401][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:13:39,401][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:13:40,121][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:13:40,733][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:13:41,392][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:13:42,052][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:13:42,711][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:13:43,369][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:13:44,027][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:13:44,686][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:13:45,345][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:13:46,002][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:13:46,660][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:13:47,319][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:13:47,978][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:13:48,635][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:13:49,292][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:13:49,953][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:13:50,608][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:13:51,267][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:13:51,925][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:13:52,583][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:13:53,242][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:13:53,899][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:13:54,557][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:13:55,216][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:13:55,874][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:13:56,532][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:13:57,193][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:13:57,852][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:13:58,518][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:13:59,179][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:13:59,836][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:14:00,494][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:14:01,151][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:14:01,810][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:14:02,468][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:14:03,126][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:14:03,783][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:14:04,442][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:14:05,100][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:14:05,758][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:14:06,417][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:14:07,074][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:14:07,733][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:14:08,390][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:14:09,048][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:14:09,706][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:14:10,364][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:14:11,021][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:14:12,014][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:14:12,672][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:14:13,330][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:14:13,987][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:14:14,645][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:14:15,304][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:14:15,962][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:14:16,622][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:14:17,279][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:14:17,939][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:14:18,597][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:14:19,254][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:14:19,912][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:14:20,569][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:14:21,227][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:14:21,885][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:14:22,542][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:14:23,310][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 00:14:24,639][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:14:24,642][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:14:24,643][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:14:25,963][__main__][INFO] - Iteration 663 took 52s (10.25% Gen, 87.22% Train). Generation: 5s, Training: 45s. Estimated remaining time: 4h 26m 45s. Estimated total time: 14h 35m 6s. Time estimates for 10 more iterations: 8m 45s, 100 more iterations: 1h 27m 30s, 500 more iterations: 7h 17m 33s. [2026-03-26 00:14:25,966][__main__][INFO] - Starting iteration 663. [2026-03-26 00:14:25,970][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 00:14:25,970][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:14:30,949][__main__][INFO] - Number of regex retries in iteration 663: 0 [2026-03-26 00:14:30,951][__main__][INFO] - agents played in iteration 663 are Bob, Alice [2026-03-26 00:14:31,484][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:14:31,545][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:14:31,546][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:14:31,546][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:14:32,249][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:14:32,863][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:14:33,523][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:14:34,185][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:14:34,845][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:14:35,504][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:14:36,163][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:14:36,821][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:14:37,479][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:14:38,138][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:14:38,796][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:14:39,453][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:14:40,112][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:14:40,770][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:14:41,431][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:14:42,089][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:14:42,746][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:14:43,404][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:14:44,063][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:14:44,722][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:14:45,380][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:14:46,037][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:14:46,696][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:14:47,354][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:14:48,011][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:14:48,668][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:14:49,327][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:14:49,985][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:14:50,643][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:14:51,301][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:14:51,958][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:14:52,617][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:14:53,275][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:14:53,933][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:14:54,590][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:14:55,248][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:14:55,906][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:14:56,563][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:14:57,221][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:14:57,879][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:14:58,537][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:14:59,195][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:14:59,854][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:15:00,511][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:15:01,169][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:15:01,827][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:15:02,485][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:15:03,143][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:15:04,131][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:15:04,791][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:15:05,449][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:15:06,107][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:15:06,765][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:15:07,423][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:15:08,081][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:15:08,739][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:15:09,399][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:15:10,056][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:15:10,716][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:15:11,374][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:15:12,033][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:15:12,691][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:15:13,350][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:15:14,009][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:15:14,666][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:15:15,417][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 00:15:16,799][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:15:16,802][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:15:16,803][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:15:18,196][__main__][INFO] - Iteration 664 took 52s (9.54% Gen, 87.79% Train). Generation: 4s, Training: 45s. Estimated remaining time: 4h 21m 15s. Estimated total time: 14h 30m 28s. Time estimates for 10 more iterations: 8m 42s, 100 more iterations: 1h 27m 2s, 500 more iterations: 7h 15m 14s. [2026-03-26 00:15:18,198][__main__][INFO] - Starting iteration 664. [2026-03-26 00:15:18,203][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 00:15:18,204][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:15:24,527][__main__][INFO] - Number of regex retries in iteration 664: 0 [2026-03-26 00:15:24,528][__main__][INFO] - agents played in iteration 664 are Bob, Alice [2026-03-26 00:15:25,631][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:15:25,699][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:15:25,699][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:15:25,700][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:15:26,501][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:15:27,106][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:15:27,766][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:15:28,423][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:15:29,082][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:15:29,740][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:15:30,398][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:15:31,055][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:15:31,714][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:15:32,373][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:15:33,031][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:15:33,688][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:15:34,346][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:15:35,007][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:15:35,666][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:15:36,325][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:15:36,984][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:15:37,647][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:15:38,300][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:15:38,960][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:15:40,214][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:15:40,871][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:15:41,529][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:15:42,187][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:15:42,846][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:15:43,506][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:15:44,164][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:15:44,822][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:15:45,481][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:15:46,139][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:15:46,796][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:15:47,454][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:15:48,113][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:15:48,771][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:15:49,428][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:15:50,088][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:15:50,746][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:15:51,404][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:15:52,062][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:15:52,719][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:15:53,377][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:15:54,035][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:15:54,694][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:15:55,352][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:15:56,010][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:15:56,668][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:15:57,326][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:15:57,984][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:15:58,970][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:15:59,630][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:16:00,287][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:16:00,945][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:16:01,603][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:16:02,261][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:16:02,920][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:16:03,579][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:16:04,238][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:16:04,896][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:16:05,554][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:16:06,211][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:16:06,869][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:16:07,527][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:16:08,184][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:16:08,842][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:16:09,499][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:16:10,300][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 00:16:11,625][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:16:11,628][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:16:11,629][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:16:13,012][__main__][INFO] - Iteration 665 took 54s (11.54% Gen, 85.93% Train). Generation: 6s, Training: 47s. Estimated remaining time: 5h 3m 23s. Estimated total time: 15h 13m 31s. Time estimates for 10 more iterations: 9m 8s, 100 more iterations: 1h 31m 21s, 500 more iterations: 7h 36m 45s. [2026-03-26 00:16:13,014][__main__][INFO] - Starting iteration 665. [2026-03-26 00:16:13,020][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 00:16:13,020][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:16:18,809][__main__][INFO] - Number of regex retries in iteration 665: 0 [2026-03-26 00:16:18,811][__main__][INFO] - agents played in iteration 665 are Bob, Alice [2026-03-26 00:16:19,328][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:16:19,389][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:16:19,389][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:16:19,390][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:16:20,133][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:16:20,747][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:16:21,406][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:16:22,064][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:16:22,722][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:16:23,380][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:16:24,039][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:16:24,697][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:16:25,355][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:16:26,016][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:16:26,673][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:16:27,333][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:16:27,993][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:16:28,652][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:16:29,310][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:16:29,969][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:16:30,629][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:16:31,287][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:16:31,948][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:16:32,608][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:16:33,267][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:16:33,927][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:16:34,584][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:16:35,243][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:16:35,902][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:16:36,561][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:16:37,219][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:16:37,880][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:16:38,540][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:16:39,198][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:16:39,858][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:16:40,517][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:16:41,179][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:16:41,839][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:16:42,498][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:16:43,158][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:16:43,818][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:16:44,476][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:16:45,136][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:16:45,795][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:16:46,453][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:16:47,112][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:16:47,771][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:16:48,430][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:16:49,089][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:16:49,747][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:16:50,406][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:16:51,065][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:16:52,052][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:16:52,711][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:16:53,370][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:16:54,027][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:16:54,687][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:16:55,345][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:16:56,004][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:16:56,663][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:16:58,329][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:16:58,988][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:16:59,645][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:17:00,302][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:17:00,959][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:17:01,617][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:17:02,275][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:17:02,933][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:17:03,590][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:17:04,386][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:44 [2026-03-26 00:17:05,762][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:17:05,765][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:17:05,766][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:17:07,268][__main__][INFO] - Iteration 666 took 54s (10.67% Gen, 86.55% Train). Generation: 5s, Training: 46s. Estimated remaining time: 4h 53m 9s. Estimated total time: 15h 4m 11s. Time estimates for 10 more iterations: 9m 2s, 100 more iterations: 1h 30m 25s, 500 more iterations: 7h 32m 5s. [2026-03-26 00:17:07,271][__main__][INFO] - Starting iteration 666. [2026-03-26 00:17:07,275][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 00:17:07,276][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:17:12,239][__main__][INFO] - Number of regex retries in iteration 666: 0 [2026-03-26 00:17:12,240][__main__][INFO] - agents played in iteration 666 are Bob, Alice [2026-03-26 00:17:12,732][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:17:12,796][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:17:12,797][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:17:12,797][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:17:13,752][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:17:14,355][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:17:15,014][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:17:15,673][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:17:16,331][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:17:16,988][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:17:17,649][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:17:18,305][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:17:18,963][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:17:19,623][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:17:20,280][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:17:20,939][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:17:21,597][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:17:22,256][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:17:22,913][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:17:23,571][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:17:24,228][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:17:24,887][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:17:25,544][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:17:26,203][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:17:26,861][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:17:27,519][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:17:28,175][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:17:28,834][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:17:29,494][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:17:30,151][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:17:30,809][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:17:31,467][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:17:32,125][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:17:32,782][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:17:33,441][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:17:34,099][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:17:34,757][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:17:35,415][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:17:36,073][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:17:36,731][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:17:37,389][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:17:38,047][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:17:38,705][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:17:39,363][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:17:40,021][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:17:40,680][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:17:41,338][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:17:41,996][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:17:42,655][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:17:43,313][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:17:43,970][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:17:44,628][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:17:45,625][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:17:46,284][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:17:46,943][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:17:47,601][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:17:48,259][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:17:48,916][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:17:49,574][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:17:50,233][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:17:50,891][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:17:51,549][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:17:52,206][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:17:52,865][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:17:53,523][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:17:54,181][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:17:54,839][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:17:55,497][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:17:56,155][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:17:56,982][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 00:17:58,305][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:17:58,308][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:17:58,309][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:18:00,087][__main__][INFO] - Iteration 667 took 52s (9.40% Gen, 87.23% Train). Generation: 4s, Training: 46s. Estimated remaining time: 4h 28m 19s. Estimated total time: 14h 40m 14s. Time estimates for 10 more iterations: 8m 48s, 100 more iterations: 1h 28m 1s, 500 more iterations: 7h 20m 7s. [2026-03-26 00:18:00,090][__main__][INFO] - Starting iteration 667. [2026-03-26 00:18:00,095][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 00:18:00,095][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:18:06,820][__main__][INFO] - Number of regex retries in iteration 667: 0 [2026-03-26 00:18:06,821][__main__][INFO] - agents played in iteration 667 are Bob, Alice [2026-03-26 00:18:07,474][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:18:07,536][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:18:07,536][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:18:07,537][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:18:08,369][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:18:08,981][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:18:09,640][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:18:10,297][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:18:10,957][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:18:11,615][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:18:12,273][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:18:12,932][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:18:13,591][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:18:14,248][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:18:14,907][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:18:15,564][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:18:16,223][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:18:16,884][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:18:17,542][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:18:18,199][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:18:18,857][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:18:19,514][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:18:20,174][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:18:20,831][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:18:21,489][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:18:22,147][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:18:22,805][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:18:23,464][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:18:24,123][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:18:24,785][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:18:25,444][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:18:26,104][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:18:26,763][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:18:27,422][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:18:28,082][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:18:28,742][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:18:29,400][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:18:30,058][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:18:30,716][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:18:31,374][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:18:32,031][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:18:32,689][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:18:33,347][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:18:34,005][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:18:34,663][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:18:35,323][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:18:35,980][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:18:36,639][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:18:37,298][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:18:37,955][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:18:38,613][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:18:39,272][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:18:40,266][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:18:40,925][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:18:41,583][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:18:42,241][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:18:42,899][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:18:43,560][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:18:44,218][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:18:44,876][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:18:45,534][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:18:46,193][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:18:46,852][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:18:47,510][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:18:48,169][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:18:48,828][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:18:49,486][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:18:50,143][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:18:50,801][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:18:51,653][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 00:18:53,107][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:18:53,110][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:18:53,111][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:18:56,830][__main__][INFO] - Iteration 668 took 56s (11.85% Gen, 81.59% Train). Generation: 6s, Training: 46s. Estimated remaining time: 5h 32m 46s. Estimated total time: 15h 45m 37s. Time estimates for 10 more iterations: 9m 27s, 100 more iterations: 1h 34m 33s, 500 more iterations: 7h 52m 48s. [2026-03-26 00:18:56,832][__main__][INFO] - Starting iteration 668. [2026-03-26 00:18:56,836][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 00:18:56,837][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:19:02,177][__main__][INFO] - Number of regex retries in iteration 668: 0 [2026-03-26 00:19:02,178][__main__][INFO] - agents played in iteration 668 are Bob, Alice [2026-03-26 00:19:03,082][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:19:03,143][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:19:03,144][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:19:03,144][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:19:03,982][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:19:04,622][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:19:05,249][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:19:05,906][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:19:06,565][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:19:07,222][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:19:07,880][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:19:08,539][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:19:09,197][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:19:09,854][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:19:10,512][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:19:11,169][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:19:11,827][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:19:12,486][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:19:13,143][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:19:13,800][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:19:14,459][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:19:15,116][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:19:15,773][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:19:16,430][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:19:17,088][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:19:17,745][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:19:18,405][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:19:19,064][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:19:19,723][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:19:20,383][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:19:21,045][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:19:21,700][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:19:22,359][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:19:23,017][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:19:23,676][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:19:24,334][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:19:24,992][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:19:25,649][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:19:26,307][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:19:26,964][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:19:27,621][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:19:28,279][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:19:28,938][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:19:29,595][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:19:30,254][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:19:30,912][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:19:31,569][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:19:32,227][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:19:32,884][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:19:33,543][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:19:34,202][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:19:34,860][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:19:35,848][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:19:36,508][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:19:37,166][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:19:37,826][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:19:38,483][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:19:39,143][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:19:39,801][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:19:40,461][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:19:41,118][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:19:41,777][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:19:42,437][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:19:43,094][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:19:43,752][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:19:44,411][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:19:45,070][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:19:45,728][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:19:46,386][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:19:47,106][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 00:19:48,437][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:19:48,440][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:19:48,441][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:19:49,917][__main__][INFO] - Iteration 669 took 53s (10.06% Gen, 87.15% Train). Generation: 5s, Training: 46s. Estimated remaining time: 4h 30m 58s. Estimated total time: 14h 44m 42s. Time estimates for 10 more iterations: 8m 50s, 100 more iterations: 1h 28m 28s, 500 more iterations: 7h 22m 21s. [2026-03-26 00:19:49,919][__main__][INFO] - Starting iteration 669. [2026-03-26 00:19:49,928][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 00:19:49,929][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:19:55,850][__main__][INFO] - Number of regex retries in iteration 669: 0 [2026-03-26 00:19:55,852][__main__][INFO] - agents played in iteration 669 are Bob, Alice [2026-03-26 00:19:56,966][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:19:57,027][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:19:57,028][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:19:57,028][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:19:57,805][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:19:58,422][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:19:59,083][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:19:59,741][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:20:00,400][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:20:01,058][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:20:01,715][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:20:02,374][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:20:03,031][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:20:03,689][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:20:04,346][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:20:05,003][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:20:05,661][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:20:06,318][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:20:06,976][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:20:07,635][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:20:08,293][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:20:08,950][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:20:09,608][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:20:10,265][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:20:10,922][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:20:11,580][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:20:12,238][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:20:12,896][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:20:13,555][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:20:14,213][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:20:14,871][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:20:15,528][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:20:16,186][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:20:16,843][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:20:17,501][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:20:18,158][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:20:18,816][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:20:19,473][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:20:20,131][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:20:20,788][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:20:21,445][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:20:22,103][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:20:22,761][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:20:23,420][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:20:24,078][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:20:24,736][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:20:25,393][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:20:26,052][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:20:26,711][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:20:27,369][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:20:28,026][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:20:28,685][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:20:29,680][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:20:30,341][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:20:31,000][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:20:31,659][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:20:32,317][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:20:32,977][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:20:33,636][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:20:34,293][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:20:34,953][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:20:35,611][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:20:36,269][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:20:36,927][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:20:37,587][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:20:38,244][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:20:38,902][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:20:39,562][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:20:40,219][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:20:40,914][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 00:20:42,239][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:20:42,242][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:20:42,243][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:20:43,775][__main__][INFO] - Iteration 670 took 53s (11.00% Gen, 86.14% Train). Generation: 5s, Training: 46s. Estimated remaining time: 4h 42m 54s. Estimated total time: 14h 57m 32s. Time estimates for 10 more iterations: 8m 58s, 100 more iterations: 1h 29m 45s, 500 more iterations: 7h 28m 46s. [2026-03-26 00:20:43,777][__main__][INFO] - Starting iteration 670. [2026-03-26 00:20:43,781][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 00:20:43,782][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:20:51,020][__main__][INFO] - Number of regex retries in iteration 670: 0 [2026-03-26 00:20:51,021][__main__][INFO] - agents played in iteration 670 are Bob, Alice [2026-03-26 00:20:51,636][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:20:51,698][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:20:51,700][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:20:51,701][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:20:52,499][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:20:53,121][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:20:53,780][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:20:54,438][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:20:55,095][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:20:55,754][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:20:56,412][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:20:57,069][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:20:57,730][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:20:58,387][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:20:59,045][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:20:59,702][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:21:00,360][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:21:01,018][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:21:01,675][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:21:02,333][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:21:02,991][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:21:03,648][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:21:04,317][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:21:04,980][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:21:05,634][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:21:06,294][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:21:06,951][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:21:07,610][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:21:08,268][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:21:08,926][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:21:09,585][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:21:10,243][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:21:10,900][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:21:11,558][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:21:12,216][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:21:12,874][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:21:13,531][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:21:14,190][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:21:14,848][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:21:15,507][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:21:16,165][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:21:16,823][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:21:17,481][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:21:18,139][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:21:18,796][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:21:19,455][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:21:20,113][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:21:20,771][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:21:21,431][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:21:22,092][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:21:22,751][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:21:23,410][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:21:24,425][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:21:25,963][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:21:26,618][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:21:27,275][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:21:27,934][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:21:28,592][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:21:29,250][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:21:29,909][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:21:30,567][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:21:31,225][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:21:31,884][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:21:32,542][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:21:33,199][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:21:33,857][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:21:34,516][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:21:35,174][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:21:35,833][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:21:36,702][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:44 [2026-03-26 00:21:38,065][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:21:38,068][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:21:38,069][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:21:42,332][__main__][INFO] - Iteration 671 took 58s (12.36% Gen, 80.35% Train). Generation: 7s, Training: 47s. Estimated remaining time: 6h 0m 16s. Estimated total time: 16h 15m 53s. Time estimates for 10 more iterations: 9m 45s, 100 more iterations: 1h 37m 35s, 500 more iterations: 8h 7m 56s. [2026-03-26 00:21:42,342][__main__][INFO] - Starting iteration 671. [2026-03-26 00:21:42,384][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 00:21:42,384][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:21:48,419][__main__][INFO] - Number of regex retries in iteration 671: 0 [2026-03-26 00:21:48,420][__main__][INFO] - agents played in iteration 671 are Bob, Alice [2026-03-26 00:21:48,943][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:21:49,004][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:21:49,004][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:21:49,005][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:21:49,844][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:21:50,463][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:21:51,123][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:21:51,782][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:21:52,443][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:21:53,103][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:21:53,762][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:21:54,422][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:21:55,081][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:21:55,741][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:21:56,400][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:21:57,058][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:21:57,717][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:21:58,377][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:21:59,036][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:21:59,695][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:22:00,354][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:22:01,013][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:22:01,672][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:22:02,331][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:22:02,991][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:22:03,650][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:22:04,308][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:22:04,967][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:22:05,627][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:22:06,286][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:22:06,944][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:22:07,603][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:22:08,263][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:22:08,921][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:22:09,579][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:22:10,239][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:22:10,898][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:22:11,557][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:22:12,215][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:22:12,875][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:22:13,534][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:22:14,193][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:22:14,856][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:22:15,514][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:22:16,173][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:22:16,831][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:22:17,490][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:22:18,148][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:22:18,809][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:22:19,469][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:22:20,128][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:22:20,787][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:22:21,783][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:22:22,446][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:22:23,101][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:22:23,759][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:22:24,417][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:22:25,076][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:22:25,734][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:22:26,392][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:22:27,052][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:22:27,710][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:22:28,369][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:22:29,027][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:22:29,690][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:22:30,346][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:22:31,006][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:22:31,665][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:22:32,324][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:22:33,119][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 00:22:34,469][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:22:34,472][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:22:34,473][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:22:35,952][__main__][INFO] - Iteration 672 took 53s (11.27% Gen, 85.97% Train). Generation: 6s, Training: 46s. Estimated remaining time: 4h 36m 19s. Estimated total time: 14h 52m 49s. Time estimates for 10 more iterations: 8m 55s, 100 more iterations: 1h 29m 16s, 500 more iterations: 7h 26m 24s. [2026-03-26 00:22:35,954][__main__][INFO] - Starting iteration 672. [2026-03-26 00:22:35,959][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 00:22:35,959][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:22:40,764][__main__][INFO] - Number of regex retries in iteration 672: 0 [2026-03-26 00:22:40,765][__main__][INFO] - agents played in iteration 672 are Bob, Alice [2026-03-26 00:22:41,836][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:22:41,896][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:22:41,897][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:22:41,897][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:22:42,744][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:22:43,354][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:22:44,014][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:22:44,672][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:22:45,330][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:22:45,987][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:22:46,647][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:22:47,305][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:22:47,963][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:22:48,623][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:22:49,281][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:22:49,939][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:22:50,599][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:22:51,256][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:22:51,918][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:22:52,580][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:22:53,239][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:22:53,898][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:22:54,556][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:22:55,214][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:22:55,872][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:22:56,531][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:22:57,188][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:22:57,847][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:22:58,505][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:22:59,163][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:22:59,822][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:23:00,483][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:23:01,151][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:23:01,811][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:23:02,470][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:23:03,132][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:23:03,789][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:23:04,448][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:23:05,107][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:23:05,765][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:23:06,423][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:23:07,081][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:23:07,739][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:23:08,398][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:23:09,055][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:23:09,714][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:23:10,372][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:23:11,031][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:23:11,688][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:23:12,346][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:23:13,004][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:23:13,661][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:23:14,654][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:23:15,314][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:23:15,973][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:23:16,631][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:23:17,289][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:23:17,947][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:23:18,605][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:23:19,265][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:23:19,922][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:23:20,581][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:23:21,239][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:23:21,896][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:23:22,554][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:23:23,211][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:23:23,869][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:23:24,527][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:23:25,185][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:23:25,970][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 00:23:27,315][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:23:27,318][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:23:27,319][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:23:29,121][__main__][INFO] - Iteration 673 took 53s (9.04% Gen, 87.57% Train). Generation: 4s, Training: 46s. Estimated remaining time: 4h 28m 40s. Estimated total time: 14h 46m 4s. Time estimates for 10 more iterations: 8m 51s, 100 more iterations: 1h 28m 36s, 500 more iterations: 7h 23m 2s. [2026-03-26 00:23:29,123][__main__][INFO] - Starting iteration 673. [2026-03-26 00:23:29,127][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 00:23:29,128][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:23:43,691][__main__][INFO] - Number of regex retries in iteration 673: 0 [2026-03-26 00:23:43,693][__main__][INFO] - agents played in iteration 673 are Bob, Alice [2026-03-26 00:23:44,323][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:23:44,386][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:23:44,387][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:23:44,388][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:23:45,235][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:23:45,923][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:23:46,548][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:23:47,206][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:23:47,863][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:23:48,521][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:23:49,180][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:23:49,839][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:23:50,496][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:23:51,154][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:23:51,812][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:23:52,469][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:23:53,126][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:23:53,785][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:23:54,443][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:23:55,101][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:23:55,759][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:23:56,416][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:23:57,075][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:23:57,732][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:23:58,390][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:23:59,048][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:23:59,706][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:24:00,363][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:24:01,021][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:24:01,679][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:24:02,337][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:24:02,994][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:24:03,651][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:24:04,309][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:24:04,966][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:24:05,624][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:24:06,282][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:24:06,941][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:24:07,599][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:24:08,257][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:24:08,914][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:24:09,572][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:24:10,231][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:24:10,888][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:24:11,546][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:24:12,203][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:24:12,861][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:24:13,518][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:24:14,176][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:24:14,835][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:24:15,494][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:24:16,152][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:24:17,143][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:24:17,801][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:24:18,459][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:24:19,116][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:24:19,774][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:24:20,431][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:24:21,089][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:24:21,747][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:24:22,405][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:24:23,062][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:24:23,719][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:24:24,378][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:24:25,036][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:24:25,694][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:24:26,352][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:24:27,011][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:24:27,668][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:24:28,515][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 00:24:29,915][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:24:29,918][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:24:29,919][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:24:31,283][__main__][INFO] - Iteration 674 took 1m 2s (23.43% Gen, 74.37% Train). Generation: 14s, Training: 46s. Estimated remaining time: 6h 57m 31s. Estimated total time: 17h 15m 57s. Time estimates for 10 more iterations: 10m 21s, 100 more iterations: 1h 43m 35s, 500 more iterations: 8h 37m 58s. [2026-03-26 00:24:31,285][__main__][INFO] - Starting iteration 674. [2026-03-26 00:24:31,289][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 00:24:31,290][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:24:37,076][__main__][INFO] - Number of regex retries in iteration 674: 0 [2026-03-26 00:24:37,077][__main__][INFO] - agents played in iteration 674 are Bob, Alice [2026-03-26 00:24:37,570][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:24:37,631][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:24:37,632][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:24:37,632][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:24:38,471][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:24:39,087][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:24:39,748][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:24:40,407][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:24:41,064][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:24:41,723][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:24:42,381][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:24:43,039][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:24:43,698][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:24:44,357][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:24:45,014][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:24:45,672][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:24:46,330][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:24:46,988][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:24:47,646][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:24:48,304][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:24:49,707][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:24:50,362][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:24:51,020][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:24:51,677][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:24:52,336][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:24:52,994][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:24:53,651][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:24:54,309][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:24:54,970][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:24:55,628][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:24:56,286][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:24:56,944][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:24:57,602][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:24:58,262][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:24:58,921][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:24:59,578][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:25:00,237][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:25:00,895][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:25:01,553][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:25:02,212][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:25:02,870][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:25:03,527][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:25:04,187][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:25:04,846][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:25:05,504][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:25:06,162][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:25:06,820][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:25:07,478][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:25:08,136][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:25:08,794][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:25:09,453][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:25:10,110][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:25:11,101][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:25:11,760][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:25:12,417][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:25:13,078][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:25:13,735][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:25:14,393][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:25:15,052][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:25:15,709][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:25:16,367][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:25:17,025][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:25:17,682][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:25:18,340][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:25:18,997][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:25:19,656][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:25:20,314][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:25:20,972][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:25:21,630][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:25:22,414][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 00:25:23,742][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:25:23,744][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:25:23,745][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:25:25,278][__main__][INFO] - Iteration 675 took 53s (10.72% Gen, 86.44% Train). Generation: 5s, Training: 46s. Estimated remaining time: 4h 40m 30s. Estimated total time: 14h 59m 50s. Time estimates for 10 more iterations: 8m 59s, 100 more iterations: 1h 29m 59s, 500 more iterations: 7h 29m 55s. [2026-03-26 00:25:25,280][__main__][INFO] - Starting iteration 675. [2026-03-26 00:25:25,284][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 00:25:25,314][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:25:30,311][__main__][INFO] - Number of regex retries in iteration 675: 0 [2026-03-26 00:25:30,312][__main__][INFO] - agents played in iteration 675 are Bob, Alice [2026-03-26 00:25:30,841][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:25:30,903][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:25:30,903][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:25:30,904][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:25:31,665][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:25:32,283][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:25:32,942][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:25:33,601][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:25:34,260][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:25:34,918][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:25:35,576][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:25:36,233][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:25:36,893][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:25:37,553][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:25:38,212][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:25:38,871][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:25:39,529][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:25:40,187][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:25:40,845][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:25:41,503][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:25:42,160][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:25:42,820][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:25:43,478][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:25:44,136][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:25:44,795][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:25:45,454][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:25:46,111][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:25:46,769][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:25:47,427][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:25:48,085][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:25:48,743][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:25:49,401][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:25:50,058][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:25:50,717][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:25:51,375][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:25:52,033][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:25:52,691][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:25:53,350][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:25:54,009][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:25:54,668][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:25:55,325][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:25:55,983][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:25:56,642][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:25:57,299][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:25:57,957][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:25:58,615][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:25:59,273][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:25:59,931][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:26:00,589][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:26:01,247][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:26:01,904][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:26:02,562][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:26:03,555][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:26:04,215][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:26:04,872][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:26:05,530][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:26:06,882][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:26:07,541][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:26:08,199][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:26:08,857][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:26:09,515][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:26:10,172][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:26:10,830][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:26:11,487][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:26:12,145][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:26:12,803][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:26:13,461][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:26:14,119][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:26:14,776][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:26:15,760][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:44 [2026-03-26 00:26:17,193][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:26:17,195][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:26:17,196][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:26:19,301][__main__][INFO] - Iteration 676 took 54s (9.25% Gen, 86.79% Train). Generation: 4s, Training: 46s. Estimated remaining time: 4h 40m 5s. Estimated total time: 15h 0m 19s. Time estimates for 10 more iterations: 9m 0s, 100 more iterations: 1h 30m 1s, 500 more iterations: 7h 30m 9s. [2026-03-26 00:26:19,303][__main__][INFO] - Starting iteration 676. [2026-03-26 00:26:19,308][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 00:26:19,309][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:26:24,306][__main__][INFO] - Number of regex retries in iteration 676: 0 [2026-03-26 00:26:24,307][__main__][INFO] - agents played in iteration 676 are Bob, Alice [2026-03-26 00:26:25,353][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:26:25,429][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:26:25,430][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:26:25,431][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:26:26,306][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:26:26,914][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:26:27,574][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:26:28,233][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:26:28,892][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:26:29,554][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:26:30,221][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:26:30,882][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:26:31,541][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:26:32,200][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:26:32,858][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:26:33,517][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:26:34,175][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:26:34,832][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:26:35,490][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:26:36,148][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:26:36,806][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:26:37,464][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:26:38,122][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:26:38,779][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:26:39,437][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:26:40,095][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:26:40,753][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:26:41,411][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:26:42,069][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:26:42,727][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:26:43,384][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:26:44,042][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:26:44,700][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:26:45,358][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:26:46,015][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:26:46,673][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:26:47,330][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:26:47,990][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:26:48,646][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:26:49,304][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:26:49,962][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:26:50,621][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:26:51,279][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:26:51,936][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:26:52,594][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:26:53,251][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:26:53,909][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:26:54,566][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:26:55,224][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:26:55,882][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:26:56,540][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:26:57,198][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:26:58,237][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:26:58,896][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:26:59,554][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:27:00,212][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:27:00,870][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:27:01,528][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:27:02,186][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:27:02,846][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:27:03,504][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:27:04,163][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:27:04,823][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:27:05,484][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:27:06,144][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:27:06,805][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:27:07,465][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:27:08,125][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:27:08,808][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:27:09,696][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 00:27:11,074][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:27:11,078][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:27:11,079][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:27:13,422][__main__][INFO] - Iteration 677 took 54s (9.24% Gen, 86.43% Train). Generation: 4s, Training: 46s. Estimated remaining time: 4h 40m 47s. Estimated total time: 15h 1m 56s. Time estimates for 10 more iterations: 9m 1s, 100 more iterations: 1h 30m 11s, 500 more iterations: 7h 30m 58s. [2026-03-26 00:27:13,426][__main__][INFO] - Starting iteration 677. [2026-03-26 00:27:13,432][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 00:27:13,433][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:27:21,577][__main__][INFO] - Number of regex retries in iteration 677: 0 [2026-03-26 00:27:21,579][__main__][INFO] - agents played in iteration 677 are Bob, Alice [2026-03-26 00:27:22,074][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:27:22,137][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:27:22,137][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:27:22,138][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:27:22,804][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:27:23,475][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:27:24,135][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:27:24,850][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:27:25,487][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:27:26,146][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:27:26,804][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:27:27,461][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:27:28,119][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:27:28,776][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:27:29,434][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:27:30,092][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:27:30,749][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:27:31,406][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:27:32,063][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:27:32,720][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:27:33,379][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:27:34,037][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:27:34,694][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:27:35,352][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:27:36,010][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:27:36,667][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:27:37,325][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:27:37,983][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:27:38,640][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:27:39,297][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:27:39,955][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:27:40,612][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:27:41,270][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:27:41,927][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:27:42,584][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:27:43,241][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:27:43,899][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:27:44,563][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:27:45,221][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:27:45,878][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:27:46,536][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:27:47,193][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:27:47,850][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:27:48,508][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:27:49,166][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:27:49,824][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:27:50,482][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:27:51,139][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:27:51,798][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:27:52,456][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:27:53,115][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:27:53,772][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:27:54,801][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:27:55,459][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:27:56,117][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:27:56,775][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:27:57,432][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:27:58,090][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:27:58,749][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:27:59,408][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:28:00,066][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:28:00,724][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:28:01,382][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:28:02,039][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:28:02,698][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:28:03,355][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:28:04,016][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:28:04,675][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:28:05,335][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:28:06,086][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 00:28:07,443][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:28:07,446][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:28:07,447][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:28:09,915][__main__][INFO] - Iteration 678 took 56s (14.42% Gen, 81.20% Train). Generation: 8s, Training: 45s. Estimated remaining time: 5h 19m 20s. Estimated total time: 15h 41m 25s. Time estimates for 10 more iterations: 9m 24s, 100 more iterations: 1h 34m 8s, 500 more iterations: 7h 50m 42s. [2026-03-26 00:28:09,918][__main__][INFO] - Starting iteration 678. [2026-03-26 00:28:09,926][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 00:28:09,927][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:28:14,904][__main__][INFO] - Number of regex retries in iteration 678: 0 [2026-03-26 00:28:14,906][__main__][INFO] - agents played in iteration 678 are Bob, Alice [2026-03-26 00:28:15,493][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:28:15,556][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:28:15,557][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:28:15,558][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:28:16,229][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:28:16,848][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:28:17,508][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:28:18,166][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:28:18,825][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:28:19,485][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:28:20,144][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:28:20,802][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:28:21,460][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:28:22,119][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:28:22,777][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:28:23,435][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:28:24,093][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:28:24,751][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:28:25,409][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:28:26,067][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:28:26,725][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:28:27,384][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:28:28,044][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:28:28,703][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:28:29,363][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:28:30,025][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:28:30,685][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:28:31,343][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:28:32,001][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:28:32,660][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:28:33,318][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:28:33,977][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:28:34,636][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:28:35,295][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:28:35,954][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:28:36,613][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:28:37,272][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:28:37,931][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:28:38,590][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:28:39,250][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:28:39,909][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:28:40,569][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:28:41,229][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:28:41,890][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:28:42,548][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:28:43,207][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:28:43,866][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:28:44,524][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:28:45,183][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:28:45,841][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:28:46,500][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:28:47,158][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:28:48,150][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:28:48,807][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:28:49,465][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:28:50,122][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:28:50,779][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:28:51,437][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:28:52,095][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:28:52,753][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:28:53,410][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:28:54,068][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:28:54,726][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:28:55,383][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:28:56,041][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:28:56,699][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:28:57,357][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:28:58,014][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:28:58,673][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:28:59,453][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 00:29:00,788][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:29:00,791][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:29:00,793][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:29:02,626][__main__][INFO] - Iteration 679 took 52s (9.45% Gen, 87.07% Train). Generation: 4s, Training: 45s. Estimated remaining time: 4h 15m 25s. Estimated total time: 14h 38m 23s. Time estimates for 10 more iterations: 8m 47s, 100 more iterations: 1h 27m 50s, 500 more iterations: 7h 19m 11s. [2026-03-26 00:29:02,629][__main__][INFO] - Starting iteration 679. [2026-03-26 00:29:02,633][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 00:29:02,634][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:29:07,345][__main__][INFO] - Number of regex retries in iteration 679: 0 [2026-03-26 00:29:07,346][__main__][INFO] - agents played in iteration 679 are Bob, Alice [2026-03-26 00:29:07,871][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:29:07,933][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:29:07,934][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:29:07,935][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:29:08,630][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:29:09,244][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:29:09,903][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:29:10,560][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:29:11,218][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:29:11,874][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:29:12,531][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:29:13,189][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:29:13,846][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:29:14,503][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:29:15,161][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:29:15,819][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:29:16,476][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:29:17,133][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:29:17,790][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:29:18,448][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:29:19,105][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:29:19,762][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:29:20,420][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:29:21,076][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:29:21,734][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:29:22,392][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:29:23,049][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:29:23,707][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:29:24,364][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:29:25,021][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:29:25,679][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:29:26,336][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:29:26,995][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:29:27,652][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:29:28,309][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:29:28,967][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:29:29,624][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:29:30,281][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:29:30,939][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:29:31,596][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:29:32,253][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:29:32,911][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:29:33,569][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:29:34,226][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:29:34,883][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:29:35,541][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:29:36,198][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:29:36,856][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:29:37,513][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:29:38,171][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:29:38,829][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:29:39,486][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:29:40,473][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:29:41,130][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:29:41,789][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:29:42,446][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:29:43,103][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:29:43,761][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:29:44,418][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:29:45,075][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:29:45,732][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:29:46,390][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:29:47,048][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:29:47,706][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:29:48,363][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:29:49,020][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:29:49,678][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:29:50,335][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:29:50,993][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:29:51,701][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 00:29:53,085][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:29:53,088][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:29:53,090][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:29:54,893][__main__][INFO] - Iteration 680 took 52s (9.02% Gen, 87.53% Train). Generation: 4s, Training: 45s. Estimated remaining time: 4h 7m 12s. Estimated total time: 14h 31m 1s. Time estimates for 10 more iterations: 8m 42s, 100 more iterations: 1h 27m 6s, 500 more iterations: 7h 15m 30s. [2026-03-26 00:29:54,896][__main__][INFO] - Starting iteration 680. [2026-03-26 00:29:54,899][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 00:29:54,900][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:29:59,858][__main__][INFO] - Number of regex retries in iteration 680: 0 [2026-03-26 00:29:59,859][__main__][INFO] - agents played in iteration 680 are Bob, Alice [2026-03-26 00:30:00,345][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:30:00,407][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:30:00,408][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:30:00,409][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:30:01,075][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:30:01,689][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:30:02,349][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:30:03,007][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:30:03,664][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:30:04,322][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:30:04,980][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:30:05,639][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:30:06,297][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:30:06,955][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:30:07,613][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:30:08,272][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:30:08,930][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:30:09,589][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:30:10,247][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:30:10,905][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:30:11,564][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:30:12,222][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:30:12,880][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:30:13,538][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:30:14,196][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:30:14,854][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:30:15,512][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:30:16,171][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:30:16,829][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:30:17,487][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:30:18,146][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:30:18,805][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:30:19,463][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:30:20,120][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:30:20,779][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:30:21,437][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:30:22,095][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:30:22,754][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:30:23,412][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:30:24,069][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:30:24,728][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:30:25,386][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:30:26,044][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:30:26,702][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:30:27,360][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:30:28,018][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:30:28,677][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:30:29,335][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:30:29,994][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:30:30,652][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:30:31,310][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:30:31,968][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:30:32,934][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:30:33,592][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:30:34,250][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:30:34,907][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:30:35,566][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:30:36,223][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:30:36,880][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:30:37,538][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:30:38,195][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:30:38,852][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:30:39,510][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:30:40,167][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:30:40,824][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:30:41,482][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:30:42,139][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:30:42,797][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:30:43,455][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:30:44,144][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 00:30:45,484][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:30:45,487][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:30:45,489][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:30:47,717][__main__][INFO] - Iteration 681 took 52s (9.39% Gen, 86.39% Train). Generation: 4s, Training: 45s. Estimated remaining time: 4h 15m 37s. Estimated total time: 14h 40m 19s. Time estimates for 10 more iterations: 8m 48s, 100 more iterations: 1h 28m 1s, 500 more iterations: 7h 20m 9s. [2026-03-26 00:30:47,720][__main__][INFO] - Starting iteration 681. [2026-03-26 00:30:47,727][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 00:30:47,730][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:30:53,888][__main__][INFO] - Number of regex retries in iteration 681: 0 [2026-03-26 00:30:53,889][__main__][INFO] - agents played in iteration 681 are Bob, Alice [2026-03-26 00:30:54,970][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:30:55,032][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:30:55,033][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:30:55,033][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:30:55,727][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:30:56,331][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:30:56,990][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:30:57,646][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:30:58,304][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:30:58,962][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:30:59,618][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:31:00,276][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:31:00,933][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:31:01,590][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:31:02,247][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:31:02,904][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:31:03,561][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:31:04,218][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:31:04,876][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:31:05,533][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:31:06,190][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:31:06,847][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:31:07,505][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:31:08,162][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:31:08,820][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:31:09,477][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:31:10,135][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:31:10,792][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:31:11,449][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:31:12,107][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:31:12,764][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:31:13,422][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:31:14,079][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:31:14,736][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:31:15,393][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:31:16,051][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:31:16,709][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:31:17,366][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:31:18,024][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:31:18,681][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:31:19,338][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:31:19,996][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:31:20,653][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:31:21,310][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:31:21,968][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:31:22,625][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:31:23,282][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:31:23,940][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:31:24,597][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:31:25,254][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:31:25,912][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:31:26,570][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:31:27,535][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:31:28,192][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:31:28,850][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:31:29,507][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:31:30,165][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:31:30,822][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:31:31,481][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:31:32,140][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:31:32,799][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:31:33,456][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:31:34,114][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:31:34,772][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:31:35,429][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:31:36,086][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:31:36,744][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:31:37,402][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:31:38,059][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:31:38,832][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 00:31:40,175][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:31:40,179][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:31:40,180][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:31:41,687][__main__][INFO] - Iteration 682 took 53s (11.41% Gen, 85.78% Train). Generation: 6s, Training: 46s. Estimated remaining time: 4h 33m 45s. Estimated total time: 14h 59m 21s. Time estimates for 10 more iterations: 8m 59s, 100 more iterations: 1h 29m 56s, 500 more iterations: 7h 29m 40s. [2026-03-26 00:31:41,690][__main__][INFO] - Starting iteration 682. [2026-03-26 00:31:41,694][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 00:31:41,695][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:31:54,442][__main__][INFO] - Number of regex retries in iteration 682: 0 [2026-03-26 00:31:54,444][__main__][INFO] - agents played in iteration 682 are Bob, Alice [2026-03-26 00:31:54,936][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:31:55,000][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:31:55,001][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:31:55,002][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:31:55,670][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:31:56,274][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:31:56,933][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:31:57,589][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:31:58,246][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:31:58,903][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:31:59,560][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:32:00,217][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:32:00,874][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:32:01,531][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:32:02,188][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:32:02,845][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:32:03,501][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:32:04,158][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:32:04,815][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:32:05,473][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:32:06,130][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:32:06,787][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:32:07,444][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:32:08,101][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:32:08,758][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:32:09,416][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:32:10,074][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:32:10,731][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:32:11,388][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:32:12,046][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:32:12,704][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:32:13,361][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:32:14,019][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:32:14,676][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:32:15,334][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:32:15,991][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:32:16,648][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:32:17,305][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:32:17,963][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:32:18,620][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:32:19,277][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:32:19,935][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:32:20,593][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:32:21,250][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:32:21,908][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:32:22,566][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:32:23,223][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:32:23,880][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:32:24,538][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:32:25,195][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:32:25,852][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:32:26,510][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:32:27,409][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:32:28,067][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:32:28,725][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:32:29,382][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:32:30,040][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:32:30,698][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:32:31,355][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:32:32,014][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:32:32,672][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:32:33,329][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:32:33,986][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:32:34,643][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:32:35,301][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:32:35,959][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:32:36,616][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:32:37,279][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:32:37,939][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:32:38,666][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 00:32:40,003][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:32:40,006][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:32:40,007][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:32:42,637][__main__][INFO] - Iteration 683 took 1m 0s (20.92% Gen, 74.76% Train). Generation: 12s, Training: 45s. Estimated remaining time: 6h 29m 7s. Estimated total time: 16h 55m 45s. Time estimates for 10 more iterations: 10m 9s, 100 more iterations: 1h 41m 34s, 500 more iterations: 8h 27m 52s. [2026-03-26 00:32:42,640][__main__][INFO] - Starting iteration 683. [2026-03-26 00:32:42,644][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 00:32:42,645][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:32:47,466][__main__][INFO] - Number of regex retries in iteration 683: 0 [2026-03-26 00:32:47,467][__main__][INFO] - agents played in iteration 683 are Bob, Alice [2026-03-26 00:32:48,028][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:32:48,091][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:32:48,091][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:32:48,092][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:32:48,767][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:32:49,378][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:32:50,039][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:32:50,697][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:32:51,355][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:32:52,014][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:32:52,672][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:32:53,331][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:32:53,989][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:32:54,647][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:32:55,307][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:32:55,966][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:32:56,624][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:32:57,282][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:32:57,941][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:32:58,600][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:32:59,261][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:32:59,920][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:33:00,580][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:33:01,238][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:33:01,897][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:33:02,557][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:33:03,216][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:33:03,875][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:33:04,534][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:33:05,192][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:33:05,851][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:33:06,509][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:33:07,168][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:33:07,827][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:33:08,485][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:33:09,143][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:33:09,802][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:33:10,461][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:33:11,120][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:33:11,778][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:33:12,437][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:33:13,096][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:33:13,754][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:33:14,413][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:33:15,072][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:33:15,731][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:33:16,390][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:33:17,048][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:33:17,707][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:33:18,367][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:33:19,025][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:33:19,683][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:33:20,625][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:33:21,286][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:33:21,947][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:33:22,605][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:33:23,263][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:33:23,922][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:33:24,581][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:33:25,240][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:33:25,899][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:33:26,557][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:33:27,216][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:33:27,874][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:33:28,534][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:33:29,195][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:33:29,855][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:33:30,514][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:33:31,173][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:33:31,890][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 00:33:33,211][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:33:33,214][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:33:33,215][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:33:35,126][__main__][INFO] - Iteration 684 took 52s (9.19% Gen, 87.16% Train). Generation: 4s, Training: 45s. Estimated remaining time: 4h 7m 15s. Estimated total time: 14h 34m 44s. Time estimates for 10 more iterations: 8m 44s, 100 more iterations: 1h 27m 28s, 500 more iterations: 7h 17m 22s. [2026-03-26 00:33:35,129][__main__][INFO] - Starting iteration 684. [2026-03-26 00:33:35,135][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 00:33:35,136][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:33:40,570][__main__][INFO] - Number of regex retries in iteration 684: 0 [2026-03-26 00:33:40,571][__main__][INFO] - agents played in iteration 684 are Bob, Alice [2026-03-26 00:33:41,549][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:33:41,613][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:33:41,613][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:33:41,614][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:33:42,306][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:33:42,912][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:33:43,573][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:33:44,231][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:33:44,890][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:33:45,548][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:33:46,206][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:33:46,865][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:33:47,523][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:33:48,182][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:33:48,841][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:33:49,500][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:33:50,158][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:33:50,816][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:33:51,475][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:33:52,134][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:33:52,792][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:33:53,450][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:33:54,109][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:33:54,767][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:33:55,426][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:33:56,085][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:33:56,743][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:33:57,402][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:33:58,061][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:33:58,720][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:33:59,379][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:34:00,037][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:34:00,696][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:34:01,355][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:34:02,015][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:34:02,673][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:34:03,332][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:34:03,990][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:34:04,649][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:34:05,308][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:34:05,967][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:34:06,626][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:34:07,285][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:34:07,944][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:34:08,603][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:34:09,262][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:34:09,921][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:34:10,580][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:34:11,239][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:34:11,898][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:34:12,557][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:34:13,216][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:34:14,116][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:34:14,776][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:34:15,436][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:34:16,095][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:34:16,754][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:34:17,413][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:34:18,072][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:34:18,731][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:34:19,390][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:34:20,051][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:34:20,709][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:34:21,368][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:34:22,027][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:34:22,686][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:34:23,345][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:34:24,003][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:34:24,662][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:34:25,415][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 00:34:26,759][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:34:26,761][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:34:26,771][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:34:28,375][__main__][INFO] - Iteration 685 took 53s (10.21% Gen, 86.77% Train). Generation: 5s, Training: 46s. Estimated remaining time: 4h 18m 59s. Estimated total time: 14h 47m 22s. Time estimates for 10 more iterations: 8m 52s, 100 more iterations: 1h 28m 44s, 500 more iterations: 7h 23m 41s. [2026-03-26 00:34:28,379][__main__][INFO] - Starting iteration 685. [2026-03-26 00:34:28,384][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 00:34:28,385][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:34:34,522][__main__][INFO] - Number of regex retries in iteration 685: 0 [2026-03-26 00:34:34,523][__main__][INFO] - agents played in iteration 685 are Bob, Alice [2026-03-26 00:34:35,431][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:34:35,494][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:34:35,494][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:34:35,495][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:34:36,158][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:34:36,768][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:34:37,427][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:34:38,084][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:34:38,741][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:34:39,398][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:34:40,055][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:34:40,713][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:34:41,370][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:34:42,027][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:34:42,684][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:34:43,342][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:34:43,999][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:34:44,656][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:34:45,313][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:34:45,970][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:34:46,628][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:34:47,285][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:34:47,943][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:34:48,600][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:34:49,257][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:34:49,914][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:34:50,572][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:34:51,229][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:34:51,887][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:34:52,544][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:34:53,201][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:34:53,859][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:34:54,516][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:34:55,173][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:34:55,831][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:34:56,488][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:34:57,145][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:34:57,802][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:34:58,460][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:34:59,118][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:34:59,775][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:35:00,433][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:35:01,091][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:35:01,748][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:35:02,405][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:35:03,063][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:35:03,721][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:35:04,378][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:35:05,036][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:35:05,693][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:35:06,351][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:35:07,008][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:35:07,932][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:35:08,590][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:35:09,248][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:35:09,907][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:35:10,564][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:35:11,222][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:35:11,880][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:35:12,537][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:35:13,195][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:35:13,853][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:35:14,510][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:35:15,168][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:35:15,826][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:35:16,483][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:35:17,140][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:35:17,798][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:35:18,455][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:35:19,158][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 00:35:20,491][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:35:20,494][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:35:20,496][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:35:21,857][__main__][INFO] - Iteration 686 took 53s (11.48% Gen, 85.97% Train). Generation: 6s, Training: 45s. Estimated remaining time: 4h 21m 58s. Estimated total time: 14h 51m 14s. Time estimates for 10 more iterations: 8m 54s, 100 more iterations: 1h 29m 7s, 500 more iterations: 7h 25m 37s. [2026-03-26 00:35:21,859][__main__][INFO] - Starting iteration 686. [2026-03-26 00:35:21,863][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 00:35:21,864][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:35:26,589][__main__][INFO] - Number of regex retries in iteration 686: 0 [2026-03-26 00:35:26,590][__main__][INFO] - agents played in iteration 686 are Bob, Alice [2026-03-26 00:35:27,075][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:35:27,137][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:35:27,138][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:35:27,138][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:35:27,807][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:35:28,422][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:35:29,081][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:35:29,738][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:35:30,395][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:35:31,052][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:35:31,709][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:35:32,366][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:35:33,023][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:35:33,680][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:35:34,337][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:35:34,994][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:35:35,652][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:35:36,309][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:35:36,966][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:35:37,624][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:35:38,282][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:35:38,938][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:35:39,595][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:35:40,252][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:35:40,910][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:35:41,567][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:35:42,224][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:35:42,881][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:35:43,539][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:35:44,196][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:35:44,854][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:35:45,511][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:35:46,169][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:35:46,826][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:35:47,483][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:35:48,141][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:35:48,798][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:35:49,455][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:35:50,114][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:35:50,772][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:35:51,430][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:35:52,088][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:35:52,745][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:35:53,402][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:35:54,060][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:35:54,717][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:35:55,374][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:35:56,031][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:35:56,689][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:35:57,346][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:35:58,004][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:35:58,662][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:35:59,599][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:36:00,257][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:36:00,915][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:36:01,573][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:36:02,230][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:36:02,887][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:36:03,545][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:36:04,202][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:36:04,860][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:36:05,517][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:36:06,174][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:36:06,833][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:36:07,490][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:36:08,147][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:36:08,805][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:36:09,462][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:36:10,120][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:36:10,831][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 00:36:12,179][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:36:12,182][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:36:12,183][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:36:13,675][__main__][INFO] - Iteration 687 took 51s (9.12% Gen, 87.99% Train). Generation: 4s, Training: 45s. Estimated remaining time: 3h 53m 25s. Estimated total time: 14h 23m 33s. Time estimates for 10 more iterations: 8m 38s, 100 more iterations: 1h 26m 21s, 500 more iterations: 7h 11m 46s. [2026-03-26 00:36:13,677][__main__][INFO] - Starting iteration 687. [2026-03-26 00:36:13,682][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 00:36:13,683][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:36:19,233][__main__][INFO] - Number of regex retries in iteration 687: 0 [2026-03-26 00:36:19,234][__main__][INFO] - agents played in iteration 687 are Bob, Alice [2026-03-26 00:36:19,802][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:36:19,864][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:36:19,865][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:36:19,865][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:36:20,569][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:36:21,184][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:36:21,843][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:36:22,499][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:36:23,156][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:36:23,813][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:36:24,470][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:36:25,127][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:36:25,785][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:36:26,442][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:36:27,099][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:36:27,755][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:36:28,413][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:36:29,070][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:36:29,727][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:36:30,383][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:36:31,042][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:36:31,700][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:36:32,357][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:36:33,015][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:36:33,672][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:36:34,330][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:36:34,987][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:36:35,645][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:36:36,303][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:36:36,960][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:36:37,617][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:36:38,275][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:36:38,932][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:36:39,590][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:36:40,248][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:36:40,905][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:36:41,563][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:36:42,220][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:36:42,877][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:36:43,535][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:36:44,268][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:36:44,951][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:36:45,609][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:36:46,267][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:36:46,924][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:36:47,582][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:36:48,239][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:36:48,896][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:36:49,555][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:36:50,212][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:36:50,869][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:36:51,526][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:36:52,453][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:36:53,111][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:36:53,768][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:36:54,426][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:36:55,083][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:36:55,740][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:36:56,398][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:36:57,055][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:36:57,712][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:36:58,370][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:36:59,028][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:36:59,686][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:37:00,343][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:37:01,001][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:37:01,658][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:37:02,316][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:37:02,973][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:37:03,749][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 00:37:05,082][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:37:05,085][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:37:05,086][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:37:06,720][__main__][INFO] - Iteration 688 took 53s (10.47% Gen, 86.45% Train). Generation: 5s, Training: 45s. Estimated remaining time: 4h 12m 59s. Estimated total time: 14h 44m 0s. Time estimates for 10 more iterations: 8m 50s, 100 more iterations: 1h 28m 24s, 500 more iterations: 7h 22m 0s. [2026-03-26 00:37:06,723][__main__][INFO] - Starting iteration 688. [2026-03-26 00:37:06,735][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 00:37:06,736][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:37:11,636][__main__][INFO] - Number of regex retries in iteration 688: 0 [2026-03-26 00:37:11,638][__main__][INFO] - agents played in iteration 688 are Bob, Alice [2026-03-26 00:37:12,203][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:37:12,266][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:37:12,267][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:37:12,267][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:37:12,969][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:37:13,576][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:37:14,234][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:37:14,891][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:37:15,548][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:37:16,206][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:37:16,862][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:37:17,519][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:37:18,177][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:37:18,834][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:37:19,491][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:37:20,149][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:37:20,805][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:37:21,462][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:37:22,122][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:37:22,779][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:37:23,436][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:37:24,094][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:37:24,751][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:37:25,407][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:37:26,065][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:37:26,722][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:37:27,380][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:37:28,037][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:37:28,694][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:37:29,351][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:37:30,009][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:37:30,666][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:37:31,323][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:37:31,980][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:37:32,638][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:37:33,295][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:37:33,953][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:37:34,610][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:37:35,268][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:37:35,926][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:37:36,583][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:37:37,240][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:37:37,898][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:37:38,555][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:37:39,212][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:37:39,870][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:37:40,528][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:37:41,186][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:37:41,844][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:37:42,501][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:37:43,159][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:37:43,816][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:37:44,771][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:37:45,429][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:37:46,087][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:37:46,744][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:37:47,402][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:37:48,059][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:37:48,716][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:37:49,375][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:37:50,033][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:37:50,690][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:37:51,347][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:37:52,005][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:37:52,663][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:37:53,320][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:37:53,978][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:37:54,635][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:37:55,293][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:37:55,990][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 00:37:57,316][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:37:57,319][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:37:57,320][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:37:59,083][__main__][INFO] - Iteration 689 took 52s (9.36% Gen, 87.26% Train). Generation: 4s, Training: 45s. Estimated remaining time: 4h 0m 37s. Estimated total time: 14h 32m 31s. Time estimates for 10 more iterations: 8m 43s, 100 more iterations: 1h 27m 15s, 500 more iterations: 7h 16m 15s. [2026-03-26 00:37:59,086][__main__][INFO] - Starting iteration 689. [2026-03-26 00:37:59,091][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 00:37:59,091][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:38:09,036][__main__][INFO] - Number of regex retries in iteration 689: 0 [2026-03-26 00:38:09,037][__main__][INFO] - agents played in iteration 689 are Bob, Alice [2026-03-26 00:38:09,938][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:38:09,999][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:38:10,000][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:38:10,001][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:38:10,699][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:38:11,306][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:38:11,965][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:38:12,621][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:38:13,277][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:38:13,934][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:38:14,591][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:38:15,247][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:38:15,904][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:38:16,561][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:38:17,218][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:38:17,875][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:38:18,533][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:38:19,190][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:38:19,847][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:38:20,504][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:38:21,161][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:38:21,818][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:38:22,476][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:38:23,133][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:38:23,790][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:38:24,448][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:38:25,105][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:38:25,762][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:38:26,419][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:38:27,076][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:38:27,733][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:38:28,391][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:38:29,048][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:38:29,706][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:38:30,363][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:38:31,020][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:38:31,678][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:38:32,335][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:38:32,992][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:38:33,651][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:38:34,308][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:38:34,965][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:38:35,623][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:38:36,280][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:38:36,938][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:38:37,595][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:38:38,252][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:38:38,910][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:38:39,568][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:38:40,225][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:38:40,882][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:38:41,540][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:38:42,463][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:38:43,121][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:38:43,779][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:38:44,437][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:38:45,094][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:38:45,751][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:38:46,410][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:38:47,067][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:38:47,724][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:38:48,382][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:38:49,039][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:38:49,696][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:38:50,354][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:38:51,011][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:38:51,668][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:38:52,327][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:38:52,984][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:38:53,678][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:42 [2026-03-26 00:38:55,002][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:38:55,005][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:38:55,009][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:38:56,427][__main__][INFO] - Iteration 690 took 57s (17.35% Gen, 80.18% Train). Generation: 9s, Training: 45s. Estimated remaining time: 5h 22m 47s. Estimated total time: 15h 55m 38s. Time estimates for 10 more iterations: 9m 33s, 100 more iterations: 1h 35m 33s, 500 more iterations: 7h 57m 49s. [2026-03-26 00:38:56,431][__main__][INFO] - Starting iteration 690. [2026-03-26 00:38:56,436][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 00:38:56,436][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:39:01,751][__main__][INFO] - Number of regex retries in iteration 690: 0 [2026-03-26 00:39:01,752][__main__][INFO] - agents played in iteration 690 are Bob, Alice [2026-03-26 00:39:02,361][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:39:02,423][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:39:02,424][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:39:02,424][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:39:03,124][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:39:03,745][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:39:04,403][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:39:05,060][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:39:05,717][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:39:06,374][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:39:07,032][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:39:07,689][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:39:08,346][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:39:09,005][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:39:09,662][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:39:10,319][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:39:10,977][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:39:11,634][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:39:12,291][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:39:12,949][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:39:13,606][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:39:14,263][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:39:14,921][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:39:15,578][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:39:16,235][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:39:16,892][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:39:17,549][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:39:18,207][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:39:18,864][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:39:19,522][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:39:20,179][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:39:21,487][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:39:22,499][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:39:23,330][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:39:23,988][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:39:24,646][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:39:25,303][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:39:25,960][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:39:26,618][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:39:27,276][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:39:27,933][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:39:28,592][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:39:29,249][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:39:29,907][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:39:30,565][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:39:31,222][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:39:31,879][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:39:32,537][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:39:33,195][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:39:33,852][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:39:34,510][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:39:35,168][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:39:36,079][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:39:36,737][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:39:37,394][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:39:38,053][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:39:38,710][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:39:39,367][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:39:40,025][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:39:40,683][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:39:41,340][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:39:41,998][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:39:42,655][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:39:43,313][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:39:43,970][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:39:44,627][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:39:45,285][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:39:45,942][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:39:46,601][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:39:47,351][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:44 [2026-03-26 00:39:48,679][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:39:48,682][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:39:48,683][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:39:50,229][__main__][INFO] - Iteration 691 took 53s (9.88% Gen, 87.24% Train). Generation: 5s, Training: 46s. Estimated remaining time: 4h 22m 50s. Estimated total time: 14h 56m 35s. Time estimates for 10 more iterations: 8m 57s, 100 more iterations: 1h 29m 39s, 500 more iterations: 7h 28m 17s. [2026-03-26 00:39:50,232][__main__][INFO] - Starting iteration 691. [2026-03-26 00:39:50,237][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 00:39:50,238][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:39:55,042][__main__][INFO] - Number of regex retries in iteration 691: 0 [2026-03-26 00:39:55,043][__main__][INFO] - agents played in iteration 691 are Bob, Alice [2026-03-26 00:39:55,533][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:39:55,594][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:39:55,594][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:39:55,595][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:39:56,294][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:39:56,907][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:39:57,566][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:39:58,225][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:39:58,883][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:39:59,541][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:40:00,199][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:40:00,857][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:40:01,514][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:40:02,173][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:40:02,831][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:40:03,489][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:40:04,147][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:40:04,805][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:40:05,463][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:40:06,121][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:40:06,780][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:40:07,438][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:40:08,096][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:40:08,754][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:40:09,413][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:40:10,072][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:40:10,730][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:40:11,389][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:40:12,047][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:40:12,706][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:40:13,364][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:40:14,023][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:40:14,682][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:40:15,340][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:40:15,998][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:40:16,656][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:40:17,315][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:40:17,973][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:40:18,632][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:40:19,290][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:40:19,948][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:40:20,606][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:40:21,265][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:40:21,923][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:40:22,581][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:40:23,239][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:40:23,898][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:40:24,556][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:40:25,215][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:40:25,873][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:40:26,532][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:40:27,190][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:40:28,093][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:40:28,753][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:40:29,410][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:40:30,067][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:40:30,725][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:40:31,383][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:40:32,040][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:40:32,698][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:40:33,355][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:40:34,012][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:40:34,669][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:40:35,327][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:40:35,984][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:40:36,642][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:40:37,300][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:40:37,958][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:40:38,615][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:40:39,317][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 00:40:40,638][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:40:40,641][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:40:40,642][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:40:42,231][__main__][INFO] - Iteration 692 took 51s (9.24% Gen, 87.70% Train). Generation: 4s, Training: 45s. Estimated remaining time: 3h 52m 0s. Estimated total time: 14h 26m 37s. Time estimates for 10 more iterations: 8m 39s, 100 more iterations: 1h 26m 39s, 500 more iterations: 7h 13m 18s. [2026-03-26 00:40:42,234][__main__][INFO] - Starting iteration 692. [2026-03-26 00:40:42,238][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 00:40:42,238][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:40:47,092][__main__][INFO] - Number of regex retries in iteration 692: 0 [2026-03-26 00:40:47,093][__main__][INFO] - agents played in iteration 692 are Bob, Alice [2026-03-26 00:40:47,671][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:40:47,733][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:40:47,734][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:40:47,734][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:40:48,430][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:40:49,042][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:40:49,702][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:40:50,358][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:40:51,015][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:40:51,673][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:40:52,331][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:40:52,989][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:40:53,647][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:40:54,304][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:40:54,962][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:40:55,619][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:40:56,276][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:40:56,934][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:40:57,591][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:40:58,250][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:40:58,907][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:40:59,565][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:41:00,222][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:41:00,879][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:41:01,537][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:41:02,194][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:41:02,852][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:41:03,510][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:41:04,167][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:41:04,824][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:41:05,482][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:41:06,139][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:41:06,796][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:41:07,453][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:41:08,111][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:41:08,769][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:41:09,426][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:41:10,086][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:41:10,743][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:41:11,401][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:41:12,058][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:41:12,716][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:41:13,373][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:41:14,031][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:41:14,688][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:41:15,345][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:41:16,003][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:41:16,660][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:41:17,317][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:41:17,975][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:41:18,632][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:41:19,289][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:41:20,243][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:41:20,901][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:41:21,558][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:41:22,216][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:41:22,874][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:41:23,531][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:41:24,189][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:41:24,847][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:41:25,504][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:41:26,162][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:41:26,819][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:41:27,477][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:41:28,134][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:41:28,792][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:41:29,450][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:41:30,107][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:41:30,765][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:41:31,482][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 00:41:32,809][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:41:32,812][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:41:32,813][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:41:34,157][__main__][INFO] - Iteration 693 took 51s (9.35% Gen, 88.06% Train). Generation: 4s, Training: 45s. Estimated remaining time: 3h 49m 52s. Estimated total time: 14h 25m 21s. Time estimates for 10 more iterations: 8m 39s, 100 more iterations: 1h 26m 32s, 500 more iterations: 7h 12m 40s. [2026-03-26 00:41:34,159][__main__][INFO] - Starting iteration 693. [2026-03-26 00:41:34,163][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 00:41:34,164][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:41:39,737][__main__][INFO] - Number of regex retries in iteration 693: 0 [2026-03-26 00:41:39,738][__main__][INFO] - agents played in iteration 693 are Bob, Alice [2026-03-26 00:41:40,705][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:41:40,766][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:41:40,766][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:41:40,767][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:41:41,468][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:41:42,086][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:41:42,745][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:41:43,402][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:41:44,060][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:41:44,716][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:41:45,374][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:41:46,032][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:41:46,689][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:41:47,346][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:41:48,004][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:41:48,661][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:41:49,318][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:41:49,976][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:41:50,633][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:41:51,290][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:41:51,947][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:41:52,604][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:41:53,262][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:41:53,919][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:41:54,576][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:41:55,233][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:41:55,890][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:41:56,547][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:41:57,205][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:41:57,863][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:41:58,521][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:41:59,178][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:41:59,835][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:42:00,492][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:42:01,150][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:42:01,806][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:42:02,464][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:42:03,121][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:42:03,778][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:42:04,435][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:42:05,093][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:42:05,750][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:42:06,407][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:42:07,065][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:42:07,722][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:42:08,379][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:42:09,037][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:42:09,694][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:42:10,351][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:42:11,009][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:42:11,666][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:42:12,323][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:42:13,246][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:42:13,904][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:42:14,561][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:42:15,220][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:42:15,877][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:42:16,534][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:42:17,191][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:42:17,849][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:42:18,506][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:42:19,163][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:42:19,820][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:42:20,478][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:42:21,135][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:42:21,793][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:42:22,454][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:42:23,108][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:42:23,765][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:42:24,458][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:42 [2026-03-26 00:42:25,785][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:42:25,788][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:42:25,789][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:42:27,187][__main__][INFO] - Iteration 694 took 53s (10.51% Gen, 86.85% Train). Generation: 5s, Training: 46s. Estimated remaining time: 4h 7m 23s. Estimated total time: 14h 43m 45s. Time estimates for 10 more iterations: 8m 50s, 100 more iterations: 1h 28m 22s, 500 more iterations: 7h 21m 52s. [2026-03-26 00:42:27,189][__main__][INFO] - Starting iteration 694. [2026-03-26 00:42:27,194][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 00:42:27,194][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:42:34,866][__main__][INFO] - Number of regex retries in iteration 694: 0 [2026-03-26 00:42:34,867][__main__][INFO] - agents played in iteration 694 are Bob, Alice [2026-03-26 00:42:35,467][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:42:35,530][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:42:35,531][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:42:35,532][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:42:36,189][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:42:36,807][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:42:37,465][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:42:38,122][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:42:38,778][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:42:39,435][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:42:40,092][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:42:40,749][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:42:41,406][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:42:42,063][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:42:42,719][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:42:43,377][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:42:44,034][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:42:44,691][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:42:45,348][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:42:46,005][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:42:46,662][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:42:47,320][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:42:47,977][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:42:48,634][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:42:49,291][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:42:49,948][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:42:50,605][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:42:51,262][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:42:51,920][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:42:52,577][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:42:53,235][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:42:53,892][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:42:54,549][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:42:55,206][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:42:55,863][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:42:56,520][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:42:57,178][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:42:57,835][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:42:58,494][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:43:00,101][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:43:00,758][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:43:01,415][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:43:02,072][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:43:02,730][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:43:03,387][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:43:04,044][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:43:04,701][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:43:05,359][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:43:06,016][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:43:06,674][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:43:07,331][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:43:07,988][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:43:08,901][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:43:09,559][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:43:10,217][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:43:10,875][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:43:11,533][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:43:12,191][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:43:12,849][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:43:13,506][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:43:14,163][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:43:14,821][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:43:15,478][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:43:16,135][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:43:16,793][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:43:17,450][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:43:18,108][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:43:18,766][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:43:19,423][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:43:20,172][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 00:43:21,548][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:43:21,552][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:43:21,553][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:43:22,910][__main__][INFO] - Iteration 695 took 55s (13.77% Gen, 83.79% Train). Generation: 7s, Training: 46s. Estimated remaining time: 4h 51m 21s. Estimated total time: 15h 28m 38s. Time estimates for 10 more iterations: 9m 17s, 100 more iterations: 1h 32m 51s, 500 more iterations: 7h 44m 19s. [2026-03-26 00:43:22,913][__main__][INFO] - Starting iteration 695. [2026-03-26 00:43:22,917][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 00:43:22,918][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:43:27,823][__main__][INFO] - Number of regex retries in iteration 695: 0 [2026-03-26 00:43:27,824][__main__][INFO] - agents played in iteration 695 are Bob, Alice [2026-03-26 00:43:28,383][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:43:28,447][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:43:28,447][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:43:28,448][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:43:29,117][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:43:29,731][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:43:30,391][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:43:31,048][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:43:31,706][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:43:32,364][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:43:33,022][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:43:33,680][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:43:34,338][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:43:34,997][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:43:35,655][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:43:36,313][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:43:36,971][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:43:37,629][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:43:38,287][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:43:38,945][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:43:39,603][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:43:40,261][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:43:40,919][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:43:41,578][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:43:42,236][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:43:42,894][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:43:43,553][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:43:44,212][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:43:44,870][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:43:45,528][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:43:46,187][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:43:46,845][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:43:47,503][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:43:48,162][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:43:48,821][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:43:49,480][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:43:50,138][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:43:50,796][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:43:51,455][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:43:52,113][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:43:52,771][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:43:53,430][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:43:54,089][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:43:54,748][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:43:55,406][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:43:56,064][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:43:56,722][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:43:57,380][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:43:58,038][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:43:58,698][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:43:59,356][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:44:00,014][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:44:00,935][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:44:01,593][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:44:02,250][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:44:02,908][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:44:03,565][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:44:04,223][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:44:04,880][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:44:05,543][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:44:06,201][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:44:06,858][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:44:07,516][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:44:08,174][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:44:08,831][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:44:09,490][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:44:10,149][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:44:10,808][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:44:11,466][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:44:12,242][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 00:44:13,971][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:44:13,974][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:44:13,975][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:44:15,384][__main__][INFO] - Iteration 696 took 52s (9.35% Gen, 87.96% Train). Generation: 4s, Training: 46s. Estimated remaining time: 3h 56m 19s. Estimated total time: 14h 34m 29s. Time estimates for 10 more iterations: 8m 44s, 100 more iterations: 1h 27m 26s, 500 more iterations: 7h 17m 14s. [2026-03-26 00:44:15,386][__main__][INFO] - Starting iteration 696. [2026-03-26 00:44:15,390][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 00:44:15,391][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:44:22,103][__main__][INFO] - Number of regex retries in iteration 696: 0 [2026-03-26 00:44:22,104][__main__][INFO] - agents played in iteration 696 are Bob, Alice [2026-03-26 00:44:22,609][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:44:22,671][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:44:22,671][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:44:22,672][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:44:23,423][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:44:24,041][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:44:24,699][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:44:25,356][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:44:26,013][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:44:26,669][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:44:27,326][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:44:27,984][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:44:28,641][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:44:29,298][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:44:29,955][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:44:30,612][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:44:31,270][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:44:31,927][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:44:32,585][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:44:33,242][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:44:33,899][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:44:34,557][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:44:35,215][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:44:35,872][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:44:36,529][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:44:37,187][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:44:37,844][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:44:38,501][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:44:39,159][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:44:39,817][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:44:40,474][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:44:41,132][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:44:41,788][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:44:42,446][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:44:43,103][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:44:43,760][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:44:44,417][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:44:45,074][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:44:45,731][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:44:46,388][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:44:47,047][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:44:47,704][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:44:48,362][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:44:49,020][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:44:49,678][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:44:50,335][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:44:50,993][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:44:51,650][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:44:52,308][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:44:52,965][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:44:53,622][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:44:54,279][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:44:55,273][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:44:55,930][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:44:56,588][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:44:57,245][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:44:57,902][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:44:58,560][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:44:59,217][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:44:59,874][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:45:00,531][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:45:01,189][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:45:01,847][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:45:02,504][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:45:03,161][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:45:03,819][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:45:04,476][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:45:05,135][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:45:05,792][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:45:06,539][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 00:45:07,915][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:45:07,918][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:45:07,919][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:45:09,303][__main__][INFO] - Iteration 697 took 53s (12.45% Gen, 84.98% Train). Generation: 6s, Training: 45s. Estimated remaining time: 4h 19m 31s. Estimated total time: 14h 58m 34s. Time estimates for 10 more iterations: 8m 59s, 100 more iterations: 1h 29m 51s, 500 more iterations: 7h 29m 17s. [2026-03-26 00:45:09,306][__main__][INFO] - Starting iteration 697. [2026-03-26 00:45:09,310][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 00:45:09,311][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:45:14,823][__main__][INFO] - Number of regex retries in iteration 697: 0 [2026-03-26 00:45:14,824][__main__][INFO] - agents played in iteration 697 are Bob, Alice [2026-03-26 00:45:15,957][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:45:16,019][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:45:16,020][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:45:16,021][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:45:16,746][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:45:17,364][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:45:18,022][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:45:18,679][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:45:19,336][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:45:19,993][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:45:20,650][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:45:21,307][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:45:21,964][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:45:22,621][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:45:23,278][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:45:23,934][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:45:24,591][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:45:25,249][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:45:25,908][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:45:26,566][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:45:27,224][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:45:27,883][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:45:28,542][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:45:29,201][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:45:29,860][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:45:30,518][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:45:31,178][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:45:31,837][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:45:32,495][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:45:33,153][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:45:33,811][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:45:34,470][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:45:35,128][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:45:35,787][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:45:36,446][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:45:37,105][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:45:37,763][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:45:38,424][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:45:39,080][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:45:39,739][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:45:40,398][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:45:41,056][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:45:41,715][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:45:42,374][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:45:43,033][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:45:43,691][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:45:44,350][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:45:45,009][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:45:45,667][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:45:46,326][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:45:46,984][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:45:47,643][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:45:48,551][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:45:49,211][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:45:49,870][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:45:50,529][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:45:51,189][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:45:51,848][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:45:52,506][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:45:53,165][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:45:53,824][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:45:54,483][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:45:55,141][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:45:55,799][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:45:56,459][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:45:57,117][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:45:57,776][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:45:58,435][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:45:59,095][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:45:59,829][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 00:46:01,152][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:46:01,155][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:46:01,157][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:46:02,540][__main__][INFO] - Iteration 698 took 53s (10.36% Gen, 87.04% Train). Generation: 5s, Training: 46s. Estimated remaining time: 4h 7m 14s. Estimated total time: 14h 47m 11s. Time estimates for 10 more iterations: 8m 52s, 100 more iterations: 1h 28m 43s, 500 more iterations: 7h 23m 35s. [2026-03-26 00:46:02,542][__main__][INFO] - Starting iteration 698. [2026-03-26 00:46:02,547][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 00:46:02,548][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:46:08,205][__main__][INFO] - Number of regex retries in iteration 698: 0 [2026-03-26 00:46:08,207][__main__][INFO] - agents played in iteration 698 are Bob, Alice [2026-03-26 00:46:08,705][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:46:08,767][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:46:08,768][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:46:08,768][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:46:09,436][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:46:10,045][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:46:10,705][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:46:11,363][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:46:12,021][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:46:12,678][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:46:13,337][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:46:13,995][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:46:14,653][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:46:15,312][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:46:15,971][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:46:16,629][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:46:17,287][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:46:17,946][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:46:18,605][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:46:19,264][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:46:19,922][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:46:20,582][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:46:21,241][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:46:21,900][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:46:22,559][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:46:23,277][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:46:23,936][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:46:24,595][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:46:25,254][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:46:25,912][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:46:26,570][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:46:27,229][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:46:27,888][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:46:28,547][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:46:29,206][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:46:29,865][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:46:30,523][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:46:31,181][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:46:31,839][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:46:32,498][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:46:33,157][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:46:33,815][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:46:34,474][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:46:35,133][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:46:35,792][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:46:36,450][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:46:37,110][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:46:37,769][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:46:38,427][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:46:39,086][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:46:39,745][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:46:40,404][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:46:41,325][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:46:41,986][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:46:42,646][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:46:43,304][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:46:43,963][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:46:44,622][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:46:45,282][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:46:45,941][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:46:46,600][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:46:47,259][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:46:47,918][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:46:48,577][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:46:49,236][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:46:49,895][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:46:50,554][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:46:51,212][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:46:51,871][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:46:52,565][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 00:46:53,892][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:46:53,895][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:46:53,896][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:46:55,430][__main__][INFO] - Iteration 699 took 52s (10.70% Gen, 86.39% Train). Generation: 5s, Training: 45s. Estimated remaining time: 4h 0m 34s. Estimated total time: 14h 41m 24s. Time estimates for 10 more iterations: 8m 48s, 100 more iterations: 1h 28m 8s, 500 more iterations: 7h 20m 42s. [2026-03-26 00:46:55,432][__main__][INFO] - Starting iteration 699. [2026-03-26 00:46:55,439][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 00:46:55,440][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:47:00,439][__main__][INFO] - Number of regex retries in iteration 699: 0 [2026-03-26 00:47:00,441][__main__][INFO] - agents played in iteration 699 are Bob, Alice [2026-03-26 00:47:01,055][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:47:01,116][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:47:01,117][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:47:01,118][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:47:01,797][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:47:02,416][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:47:03,076][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:47:03,734][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:47:04,393][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:47:05,051][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:47:05,709][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:47:06,368][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:47:07,026][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:47:07,685][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:47:08,344][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:47:09,003][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:47:09,660][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:47:10,319][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:47:10,977][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:47:11,635][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:47:12,294][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:47:12,952][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:47:13,611][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:47:14,269][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:47:14,928][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:47:15,587][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:47:16,245][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:47:16,904][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:47:17,563][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:47:18,221][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:47:18,880][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:47:19,539][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:47:20,198][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:47:20,856][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:47:21,515][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:47:22,174][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:47:22,832][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:47:23,491][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:47:24,151][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:47:24,809][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:47:25,468][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:47:26,126][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:47:26,785][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:47:27,444][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:47:28,103][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:47:28,762][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:47:29,420][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:47:30,079][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:47:30,739][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:47:31,399][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:47:32,057][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:47:32,716][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:47:33,691][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:47:34,352][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:47:35,011][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:47:35,670][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:47:36,328][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:47:36,987][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:47:37,646][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:47:38,304][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:47:38,963][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:47:40,083][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:47:40,742][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:47:41,400][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:47:42,059][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:47:42,717][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:47:43,376][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:47:44,036][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:47:44,695][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:47:45,402][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 00:47:46,725][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:47:46,729][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:47:46,731][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:47:48,039][__main__][INFO] - Iteration 700 took 52s (9.51% Gen, 88.00% Train). Generation: 5s, Training: 46s. Estimated remaining time: 3h 54m 59s. Estimated total time: 14h 36m 42s. Time estimates for 10 more iterations: 8m 46s, 100 more iterations: 1h 27m 40s, 500 more iterations: 7h 18m 21s. [2026-03-26 00:47:48,042][__main__][INFO] - Starting iteration 700. [2026-03-26 00:47:48,046][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 00:47:48,047][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:47:52,787][__main__][INFO] - Number of regex retries in iteration 700: 0 [2026-03-26 00:47:52,789][__main__][INFO] - agents played in iteration 700 are Bob, Alice [2026-03-26 00:47:53,289][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:47:53,351][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:47:53,352][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:47:53,353][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:47:54,011][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:47:54,621][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:47:55,281][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:47:55,940][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:47:56,599][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:47:57,257][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:47:57,915][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:47:58,574][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:47:59,233][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:47:59,893][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:48:00,552][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:48:01,210][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:48:01,869][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:48:02,527][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:48:03,186][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:48:03,845][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:48:04,503][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:48:05,162][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:48:05,820][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:48:06,479][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:48:07,137][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:48:07,797][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:48:08,455][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:48:09,114][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:48:09,772][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:48:10,431][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:48:11,089][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:48:11,748][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:48:12,406][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:48:13,065][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:48:13,724][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:48:14,383][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:48:15,042][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:48:15,700][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:48:16,359][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:48:17,017][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:48:17,676][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:48:18,334][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:48:18,993][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:48:19,651][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:48:20,310][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:48:20,968][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:48:21,627][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:48:22,286][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:48:22,944][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:48:23,603][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:48:24,262][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:48:24,921][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:48:25,817][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:48:26,477][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:48:27,136][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:48:27,796][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:48:28,456][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:48:29,116][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:48:29,776][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:48:30,434][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:48:31,094][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:48:31,752][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:48:32,411][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:48:33,070][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:48:33,729][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:48:34,387][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:48:35,047][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:48:35,705][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:48:36,363][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:48:37,078][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 00:48:38,460][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:48:38,463][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:48:38,464][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:48:41,315][__main__][INFO] - Iteration 701 took 53s (8.90% Gen, 85.74% Train). Generation: 4s, Training: 45s. Estimated remaining time: 4h 5m 14s. Estimated total time: 14h 47m 50s. Time estimates for 10 more iterations: 8m 52s, 100 more iterations: 1h 28m 47s, 500 more iterations: 7h 23m 55s. [2026-03-26 00:48:41,405][__main__][INFO] - Starting iteration 701. [2026-03-26 00:48:41,409][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 00:48:41,409][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:48:46,456][__main__][INFO] - Number of regex retries in iteration 701: 0 [2026-03-26 00:48:46,457][__main__][INFO] - agents played in iteration 701 are Bob, Alice [2026-03-26 00:48:46,950][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:48:47,013][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:48:47,013][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:48:47,014][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:48:47,713][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:48:48,328][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:48:48,988][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:48:49,649][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:48:50,308][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:48:50,966][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:48:51,625][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:48:52,283][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:48:52,942][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:48:53,602][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:48:54,260][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:48:54,918][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:48:55,577][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:48:56,236][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:48:56,894][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:48:57,553][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:48:58,211][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:48:58,870][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:48:59,529][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:49:00,187][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:49:00,845][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:49:01,503][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:49:02,162][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:49:02,821][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:49:03,480][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:49:04,138][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:49:04,796][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:49:05,454][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:49:06,114][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:49:06,773][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:49:07,431][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:49:08,089][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:49:08,748][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:49:09,407][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:49:10,066][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:49:10,725][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:49:11,383][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:49:12,042][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:49:12,700][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:49:13,359][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:49:14,018][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:49:14,677][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:49:15,335][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:49:15,994][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:49:16,653][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:49:17,311][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:49:17,970][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:49:19,943][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:49:20,894][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:49:21,553][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:49:22,213][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:49:22,872][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:49:23,531][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:49:24,189][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:49:24,848][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:49:25,506][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:49:26,165][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:49:26,824][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:49:27,483][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:49:28,141][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:49:28,801][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:49:29,459][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:49:30,118][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:49:30,777][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:49:31,436][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:49:32,180][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:44 [2026-03-26 00:49:33,553][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:49:33,556][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:49:33,557][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:49:34,906][__main__][INFO] - Iteration 702 took 53s (9.43% Gen, 88.04% Train). Generation: 5s, Training: 47s. Estimated remaining time: 4h 8m 9s. Estimated total time: 14h 51m 38s. Time estimates for 10 more iterations: 8m 54s, 100 more iterations: 1h 29m 9s, 500 more iterations: 7h 25m 49s. [2026-03-26 00:49:34,909][__main__][INFO] - Starting iteration 702. [2026-03-26 00:49:34,917][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 00:49:34,918][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:49:40,977][__main__][INFO] - Number of regex retries in iteration 702: 0 [2026-03-26 00:49:40,979][__main__][INFO] - agents played in iteration 702 are Bob, Alice [2026-03-26 00:49:42,060][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:49:42,121][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:49:42,122][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:49:42,122][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:49:42,813][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:49:43,439][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:49:44,101][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:49:44,759][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:49:45,420][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:49:46,078][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:49:46,737][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:49:47,395][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:49:48,053][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:49:48,711][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:49:49,371][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:49:50,028][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:49:50,686][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:49:51,345][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:49:52,003][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:49:52,662][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:49:53,320][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:49:53,979][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:49:54,637][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:49:55,295][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:49:55,954][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:49:56,612][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:49:57,271][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:49:57,931][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:49:58,590][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:49:59,249][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:49:59,908][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:50:00,567][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:50:01,226][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:50:01,884][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:50:02,543][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:50:03,202][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:50:03,860][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:50:04,519][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:50:05,177][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:50:05,836][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:50:06,495][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:50:07,153][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:50:07,813][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:50:08,471][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:50:09,130][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:50:09,789][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:50:10,448][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:50:11,107][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:50:11,765][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:50:12,424][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:50:13,083][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:50:13,741][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:50:14,645][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:50:15,306][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:50:15,965][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:50:16,624][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:50:17,283][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:50:17,943][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:50:18,602][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:50:19,260][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:50:19,919][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:50:20,578][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:50:21,237][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:50:21,896][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:50:22,555][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:50:23,214][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:50:23,873][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:50:24,532][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:50:25,190][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:50:25,900][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 00:50:27,226][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:50:27,229][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:50:27,231][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:50:28,878][__main__][INFO] - Iteration 703 took 53s (11.23% Gen, 85.71% Train). Generation: 6s, Training: 46s. Estimated remaining time: 4h 15m 0s. Estimated total time: 14h 59m 24s. Time estimates for 10 more iterations: 8m 59s, 100 more iterations: 1h 29m 56s, 500 more iterations: 7h 29m 42s. [2026-03-26 00:50:28,881][__main__][INFO] - Starting iteration 703. [2026-03-26 00:50:28,885][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 00:50:28,885][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:50:33,653][__main__][INFO] - Number of regex retries in iteration 703: 0 [2026-03-26 00:50:33,654][__main__][INFO] - agents played in iteration 703 are Bob, Alice [2026-03-26 00:50:34,160][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:50:34,222][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:50:34,223][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:50:34,224][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:50:34,920][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:50:35,540][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:50:36,201][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:50:36,861][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:50:37,519][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:50:38,178][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:50:38,836][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:50:39,494][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:50:40,154][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:50:40,812][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:50:41,470][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:50:42,129][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:50:42,787][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:50:43,446][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:50:44,105][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:50:44,763][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:50:45,422][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:50:46,080][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:50:46,739][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:50:47,397][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:50:48,056][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:50:48,715][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:50:49,373][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:50:50,032][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:50:50,692][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:50:51,350][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:50:52,009][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:50:52,668][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:50:53,327][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:50:53,985][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:50:54,644][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:50:55,303][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:50:55,962][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:50:56,620][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:50:57,280][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:50:57,939][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:50:58,598][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:50:59,257][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:50:59,916][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:51:00,576][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:51:01,237][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:51:01,896][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:51:02,555][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:51:03,214][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:51:03,873][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:51:04,532][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:51:05,191][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:51:05,850][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:51:06,810][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:51:07,470][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:51:08,131][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:51:08,789][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:51:09,449][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:51:10,108][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:51:10,767][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:51:11,425][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:51:12,085][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:51:12,743][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:51:13,402][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:51:14,060][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:51:14,720][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:51:16,177][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:51:16,836][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:51:17,495][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:51:18,154][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:51:18,861][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 00:51:20,185][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:51:20,188][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:51:20,189][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:51:21,431][__main__][INFO] - Iteration 704 took 52s (9.07% Gen, 88.56% Train). Generation: 4s, Training: 46s. Estimated remaining time: 3h 50m 32s. Estimated total time: 14h 35m 48s. Time estimates for 10 more iterations: 8m 45s, 100 more iterations: 1h 27m 34s, 500 more iterations: 7h 17m 54s. [2026-03-26 00:51:21,434][__main__][INFO] - Starting iteration 704. [2026-03-26 00:51:21,438][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 00:51:21,438][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:51:26,338][__main__][INFO] - Number of regex retries in iteration 704: 0 [2026-03-26 00:51:26,339][__main__][INFO] - agents played in iteration 704 are Bob, Alice [2026-03-26 00:51:26,829][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:51:26,891][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:51:26,891][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:51:26,892][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:51:27,562][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:51:28,168][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:51:28,828][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:51:29,486][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:51:30,145][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:51:30,805][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:51:31,465][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:51:32,125][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:51:32,784][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:51:33,443][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:51:34,101][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:51:34,760][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:51:35,419][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:51:36,077][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:51:36,736][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:51:37,394][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:51:38,053][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:51:38,712][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:51:39,371][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:51:40,029][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:51:40,688][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:51:41,347][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:51:42,005][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:51:42,665][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:51:43,323][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:51:43,983][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:51:44,642][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:51:45,300][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:51:45,959][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:51:46,618][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:51:47,276][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:51:47,935][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:51:48,594][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:51:49,252][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:51:49,911][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:51:50,570][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:51:51,229][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:51:51,888][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:51:52,547][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:51:53,207][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:51:53,865][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:51:54,523][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:51:55,182][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:51:55,841][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:51:56,499][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:51:57,160][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:51:57,820][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:51:58,480][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:51:59,442][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:52:00,102][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:52:00,761][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:52:01,420][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:52:02,078][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:52:02,738][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:52:03,397][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:52:04,055][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:52:04,714][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:52:05,373][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:52:06,032][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:52:06,691][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:52:07,350][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:52:08,008][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:52:08,667][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:52:09,327][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:52:09,985][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:52:10,689][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 00:52:12,104][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:52:12,108][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:52:12,118][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:52:14,456][__main__][INFO] - Iteration 705 took 53s (9.24% Gen, 86.34% Train). Generation: 4s, Training: 45s. Estimated remaining time: 3h 57m 31s. Estimated total time: 14h 43m 40s. Time estimates for 10 more iterations: 8m 50s, 100 more iterations: 1h 28m 22s, 500 more iterations: 7h 21m 50s. [2026-03-26 00:52:14,461][__main__][INFO] - Starting iteration 705. [2026-03-26 00:52:14,466][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 00:52:14,466][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:52:20,634][__main__][INFO] - Number of regex retries in iteration 705: 0 [2026-03-26 00:52:20,635][__main__][INFO] - agents played in iteration 705 are Bob, Alice [2026-03-26 00:52:21,244][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:52:21,306][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:52:21,307][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:52:21,307][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:52:22,004][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:52:22,618][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:52:23,278][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:52:23,936][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:52:24,594][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:52:25,252][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:52:25,910][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:52:26,569][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:52:27,229][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:52:27,887][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:52:28,547][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:52:29,207][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:52:29,866][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:52:30,525][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:52:31,183][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:52:31,842][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:52:32,501][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:52:33,158][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:52:33,817][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:52:34,476][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:52:35,135][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:52:35,794][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:52:36,453][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:52:37,111][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:52:37,770][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:52:38,429][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:52:39,087][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:52:39,746][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:52:40,405][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:52:41,064][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:52:41,722][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:52:42,381][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:52:43,040][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:52:43,698][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:52:44,356][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:52:45,016][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:52:45,675][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:52:46,334][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:52:46,992][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:52:47,651][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:52:48,310][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:52:48,968][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:52:49,627][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:52:50,285][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:52:50,943][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:52:51,602][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:52:52,261][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:52:52,920][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:52:53,820][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:52:54,480][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:52:55,139][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:52:55,798][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:52:56,457][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:52:57,115][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:52:57,774][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:52:58,435][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:52:59,095][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:52:59,754][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:53:00,412][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:53:01,072][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:53:01,731][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:53:02,390][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:53:03,049][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:53:03,708][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:53:04,367][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:53:05,074][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 00:53:06,438][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:53:06,443][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:53:06,444][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:53:07,920][__main__][INFO] - Iteration 706 took 53s (11.54% Gen, 85.69% Train). Generation: 6s, Training: 45s. Estimated remaining time: 4h 3m 54s. Estimated total time: 14h 50m 56s. Time estimates for 10 more iterations: 8m 54s, 100 more iterations: 1h 29m 5s, 500 more iterations: 7h 25m 28s. [2026-03-26 00:53:07,922][__main__][INFO] - Starting iteration 706. [2026-03-26 00:53:07,927][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 00:53:07,928][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:53:13,806][__main__][INFO] - Number of regex retries in iteration 706: 0 [2026-03-26 00:53:13,807][__main__][INFO] - agents played in iteration 706 are Bob, Alice [2026-03-26 00:53:14,814][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:53:14,876][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:53:14,876][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:53:14,877][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:53:15,550][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:53:16,188][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:53:16,849][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:53:17,508][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:53:18,168][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:53:18,827][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:53:19,487][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:53:20,147][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:53:20,807][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:53:21,466][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:53:22,126][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:53:22,785][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:53:23,444][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:53:24,103][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:53:24,763][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:53:25,422][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:53:26,082][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:53:26,741][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:53:27,400][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:53:28,060][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:53:28,720][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:53:29,380][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:53:30,040][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:53:30,700][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:53:31,359][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:53:32,019][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:53:32,679][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:53:33,338][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:53:33,998][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:53:34,658][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:53:35,318][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:53:35,979][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:53:36,639][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:53:37,298][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:53:37,958][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:53:38,617][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:53:39,277][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:53:39,937][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:53:40,596][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:53:41,256][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:53:41,916][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:53:42,575][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:53:43,235][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:53:43,895][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:53:44,555][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:53:45,215][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:53:45,874][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:53:46,534][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:53:47,500][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:53:48,160][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:53:48,819][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:53:49,478][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:53:50,137][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:53:50,795][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:53:51,454][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:53:52,113][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:53:52,773][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:53:53,431][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:53:54,091][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:53:54,750][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:53:55,408][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:53:56,067][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:53:56,727][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:53:57,386][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:53:58,044][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:53:58,741][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 00:54:00,082][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:54:00,084][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:54:00,086][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:54:01,496][__main__][INFO] - Iteration 707 took 53s (10.97% Gen, 86.38% Train). Generation: 5s, Training: 46s. Estimated remaining time: 4h 4m 56s. Estimated total time: 14h 52m 52s. Time estimates for 10 more iterations: 8m 55s, 100 more iterations: 1h 29m 17s, 500 more iterations: 7h 26m 26s. [2026-03-26 00:54:01,499][__main__][INFO] - Starting iteration 707. [2026-03-26 00:54:01,505][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 00:54:01,506][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:54:07,836][__main__][INFO] - Number of regex retries in iteration 707: 0 [2026-03-26 00:54:07,837][__main__][INFO] - agents played in iteration 707 are Bob, Alice [2026-03-26 00:54:08,443][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:54:08,504][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:54:08,505][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:54:08,505][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:54:09,167][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:54:09,782][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:54:10,442][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:54:11,100][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:54:11,759][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:54:12,417][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:54:13,075][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:54:13,733][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:54:14,392][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:54:15,050][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:54:15,708][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:54:16,366][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:54:17,025][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:54:17,683][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:54:18,341][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:54:19,000][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:54:19,658][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:54:20,317][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:54:20,975][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:54:21,634][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:54:22,293][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:54:22,952][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:54:23,613][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:54:24,270][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:54:24,928][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:54:25,587][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:54:26,245][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:54:26,904][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:54:27,562][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:54:28,221][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:54:28,880][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:54:29,538][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:54:30,197][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:54:30,855][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:54:31,514][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:54:32,173][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:54:32,831][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:54:33,490][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:54:34,149][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:54:34,807][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:54:35,466][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:54:36,125][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:54:36,785][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:54:37,443][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:54:38,101][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:54:38,760][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:54:39,419][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:54:40,078][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:54:40,978][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:54:41,639][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:54:42,298][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:54:42,957][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:54:43,616][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:54:44,276][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:54:44,934][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:54:45,593][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:54:46,253][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:54:46,912][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:54:47,571][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:54:48,229][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:54:48,888][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:54:49,547][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:54:50,206][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:54:50,864][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:54:51,523][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:54:52,221][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 00:54:53,550][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:54:53,553][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:54:53,554][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:54:54,851][__main__][INFO] - Iteration 708 took 53s (11.87% Gen, 85.69% Train). Generation: 6s, Training: 45s. Estimated remaining time: 4h 0m 18s. Estimated total time: 14h 49m 8s. Time estimates for 10 more iterations: 8m 53s, 100 more iterations: 1h 28m 54s, 500 more iterations: 7h 24m 34s. [2026-03-26 00:54:54,854][__main__][INFO] - Starting iteration 708. [2026-03-26 00:54:54,859][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 00:54:54,860][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:54:59,562][__main__][INFO] - Number of regex retries in iteration 708: 0 [2026-03-26 00:54:59,563][__main__][INFO] - agents played in iteration 708 are Bob, Alice [2026-03-26 00:55:00,074][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:55:00,136][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:55:00,137][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:55:00,138][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:55:00,856][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:55:01,471][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:55:02,131][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:55:02,790][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:55:03,448][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:55:04,107][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:55:04,765][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:55:05,423][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:55:06,082][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:55:06,741][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:55:07,399][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:55:08,058][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:55:08,716][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:55:09,375][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:55:10,034][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:55:10,692][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:55:11,350][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:55:12,009][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:55:12,668][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:55:13,327][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:55:13,986][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:55:14,644][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:55:15,303][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:55:15,961][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:55:16,620][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:55:17,278][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:55:17,937][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:55:18,595][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:55:19,254][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:55:19,913][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:55:20,572][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:55:21,230][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:55:21,889][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:55:22,548][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:55:23,208][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:55:23,867][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:55:24,526][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:55:25,185][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:55:25,844][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:55:26,503][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:55:27,162][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:55:27,821][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:55:28,481][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:55:29,141][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:55:29,801][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:55:30,459][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:55:31,119][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:55:31,777][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:55:32,728][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:55:33,389][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:55:34,048][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:55:34,707][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:55:35,366][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:55:36,026][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:55:36,685][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:55:37,344][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:55:38,003][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:55:38,661][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:55:39,320][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:55:39,979][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:55:40,638][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:55:41,296][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:55:41,955][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:55:42,614][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:55:43,272][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:55:43,988][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 00:55:45,349][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:55:45,351][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:55:45,353][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:55:46,966][__main__][INFO] - Iteration 709 took 52s (9.03% Gen, 87.87% Train). Generation: 4s, Training: 45s. Estimated remaining time: 3h 38m 49s. Estimated total time: 14h 28m 30s. Time estimates for 10 more iterations: 8m 41s, 100 more iterations: 1h 26m 51s, 500 more iterations: 7h 14m 15s. [2026-03-26 00:55:46,969][__main__][INFO] - Starting iteration 709. [2026-03-26 00:55:46,975][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 00:55:46,975][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:55:51,766][__main__][INFO] - Number of regex retries in iteration 709: 0 [2026-03-26 00:55:51,767][__main__][INFO] - agents played in iteration 709 are Bob, Alice [2026-03-26 00:55:52,374][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:55:52,436][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:55:52,437][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:55:52,438][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:55:53,132][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:55:53,738][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:55:54,398][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:55:55,056][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:55:55,715][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:55:56,373][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:55:57,032][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:55:57,690][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:55:58,350][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:55:59,010][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:55:59,669][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:56:00,327][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:56:00,986][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:56:01,645][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:56:02,303][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:56:02,961][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:56:03,620][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:56:04,279][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:56:04,937][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:56:05,596][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:56:06,255][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:56:06,913][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:56:07,572][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:56:08,231][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:56:08,890][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:56:09,548][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:56:10,207][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:56:10,866][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:56:11,525][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:56:12,185][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:56:12,843][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:56:13,503][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:56:14,161][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:56:14,821][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:56:15,479][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:56:16,138][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:56:16,796][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:56:17,455][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:56:18,113][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:56:18,772][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:56:19,430][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:56:20,088][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:56:20,747][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:56:21,406][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:56:22,065][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:56:22,723][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:56:23,381][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:56:24,040][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:56:24,954][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:56:25,614][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:56:26,272][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:56:26,931][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:56:27,591][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:56:28,249][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:56:28,911][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:56:29,570][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:56:30,228][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:56:30,888][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:56:31,547][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:56:32,205][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:56:32,864][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:56:33,522][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:56:34,181][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:56:34,839][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:56:35,498][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:56:36,242][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 00:56:37,599][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:56:37,603][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:56:37,604][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:56:39,211][__main__][INFO] - Iteration 710 took 52s (9.17% Gen, 87.74% Train). Generation: 4s, Training: 45s. Estimated remaining time: 3h 40m 5s. Estimated total time: 14h 30m 39s. Time estimates for 10 more iterations: 8m 42s, 100 more iterations: 1h 27m 3s, 500 more iterations: 7h 15m 19s. [2026-03-26 00:56:39,214][__main__][INFO] - Starting iteration 710. [2026-03-26 00:56:39,218][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 00:56:39,218][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:56:46,027][__main__][INFO] - Number of regex retries in iteration 710: 0 [2026-03-26 00:56:46,029][__main__][INFO] - agents played in iteration 710 are Bob, Alice [2026-03-26 00:56:46,615][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:56:46,676][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:56:46,677][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:56:46,678][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:56:47,349][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:56:47,964][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:56:48,624][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:56:49,283][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:56:49,941][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:56:50,600][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:56:51,258][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:56:51,918][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:56:52,577][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:56:53,236][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:56:53,894][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:56:54,552][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:56:55,211][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:56:55,869][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:56:56,528][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:56:57,187][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:56:57,844][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:56:58,503][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:56:59,164][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:56:59,822][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:57:00,480][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:57:01,140][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:57:01,799][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:57:02,456][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:57:03,115][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:57:03,773][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:57:04,432][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:57:05,090][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:57:05,748][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:57:06,407][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:57:07,066][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:57:07,725][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:57:08,383][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:57:09,042][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:57:09,700][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:57:10,358][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:57:11,017][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:57:11,675][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:57:12,334][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:57:12,992][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:57:13,650][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:57:14,308][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:57:14,967][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:57:15,625][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:57:16,284][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:57:16,942][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:57:17,600][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:57:18,260][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:57:19,194][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:57:19,852][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:57:20,513][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:57:21,193][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:57:21,852][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:57:22,511][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:57:23,170][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:57:23,828][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:57:24,486][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:57:25,144][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:57:25,804][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:57:26,462][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:57:27,120][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:57:27,779][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:57:28,439][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:57:29,098][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:57:29,757][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:57:30,484][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 00:57:31,823][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:57:31,826][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:57:31,827][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:57:33,452][__main__][INFO] - Iteration 711 took 54s (12.56% Gen, 84.44% Train). Generation: 6s, Training: 45s. Estimated remaining time: 4h 12m 28s. Estimated total time: 15h 3m 56s. Time estimates for 10 more iterations: 9m 2s, 100 more iterations: 1h 30m 23s, 500 more iterations: 7h 31m 58s. [2026-03-26 00:57:33,454][__main__][INFO] - Starting iteration 711. [2026-03-26 00:57:33,462][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 00:57:33,463][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:57:34,479][mllm.models.large_language_model_local][WARNING] - Response \ did not match regex: (|), retry 1/1 [2026-03-26 00:57:39,550][__main__][INFO] - Number of regex retries in iteration 711: 1 [2026-03-26 00:57:39,552][__main__][INFO] - agents played in iteration 711 are Bob, Alice [2026-03-26 00:57:40,499][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:57:40,560][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:57:40,561][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:57:40,562][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:57:41,233][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:57:41,851][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:57:42,511][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:57:43,171][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:57:43,829][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:57:44,487][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:57:45,145][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:57:45,803][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:57:46,461][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:57:47,120][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:57:47,778][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:57:48,436][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:57:49,094][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:57:49,753][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:57:50,411][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:57:51,069][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:57:51,728][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:57:52,387][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:57:53,046][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:57:53,705][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:57:54,363][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:57:55,022][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:57:55,680][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:57:56,340][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:57:56,999][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:57:57,658][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:57:58,317][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:57:58,977][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:57:59,636][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:58:00,295][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:58:00,953][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:58:01,612][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:58:02,270][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:58:02,929][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:58:03,587][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:58:04,246][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:58:04,904][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:58:05,563][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:58:06,222][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:58:06,880][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:58:07,539][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:58:08,198][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:58:08,856][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:58:09,516][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:58:10,175][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:58:10,833][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:58:11,492][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:58:12,151][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:58:13,110][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:58:13,770][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:58:14,429][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:58:15,088][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:58:15,747][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:58:16,405][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:58:17,064][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:58:17,722][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:58:18,381][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:58:19,041][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:58:19,700][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:58:20,359][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:58:21,017][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:58:21,676][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:58:22,335][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:58:22,994][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:58:23,653][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:58:24,363][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 00:58:25,846][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:58:25,849][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:58:25,850][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:58:27,226][__main__][INFO] - Iteration 712 took 53s (11.33% Gen, 86.11% Train). Generation: 6s, Training: 46s. Estimated remaining time: 4h 3m 43s. Estimated total time: 14h 56m 5s. Time estimates for 10 more iterations: 8m 57s, 100 more iterations: 1h 29m 36s, 500 more iterations: 7h 28m 2s. [2026-03-26 00:58:27,228][__main__][INFO] - Starting iteration 712. [2026-03-26 00:58:27,234][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 00:58:27,234][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:58:32,252][__main__][INFO] - Number of regex retries in iteration 712: 0 [2026-03-26 00:58:32,253][__main__][INFO] - agents played in iteration 712 are Bob, Alice [2026-03-26 00:58:32,771][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:58:32,833][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:58:32,833][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:58:32,834][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:58:33,501][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:58:34,105][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:58:34,765][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:58:35,424][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:58:36,082][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:58:36,740][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:58:37,398][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:58:38,057][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:58:38,715][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:58:39,374][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:58:40,033][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:58:40,691][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:58:41,350][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:58:42,008][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:58:42,667][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:58:43,326][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:58:43,984][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:58:44,643][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:58:45,301][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:58:45,960][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:58:46,619][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:58:47,277][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:58:47,935][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:58:48,594][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:58:49,252][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:58:49,911][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:58:50,569][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:58:51,228][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:58:51,886][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:58:52,545][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:58:53,204][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:58:53,863][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:58:54,521][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:58:55,180][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:58:55,840][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:58:56,499][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:58:57,157][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:58:57,815][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:58:58,475][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:58:59,136][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:58:59,795][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:59:00,454][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:59:01,113][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:59:01,772][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:59:02,431][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:59:03,090][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:59:03,748][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:59:04,407][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:59:05,338][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:59:05,998][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:59:06,658][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:59:07,317][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:59:07,975][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:59:08,634][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:59:09,293][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:59:09,952][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:59:10,611][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:59:11,270][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:59:11,928][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:59:12,587][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:59:13,245][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:59:13,904][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:59:14,562][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:59:15,221][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:59:15,880][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:59:16,580][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 00:59:17,989][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:59:17,992][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:59:17,993][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:59:19,375][__main__][INFO] - Iteration 713 took 52s (9.62% Gen, 87.72% Train). Generation: 5s, Training: 45s. Estimated remaining time: 3h 35m 50s. Estimated total time: 14h 29m 4s. Time estimates for 10 more iterations: 8m 41s, 100 more iterations: 1h 26m 54s, 500 more iterations: 7h 14m 32s. [2026-03-26 00:59:19,377][__main__][INFO] - Starting iteration 713. [2026-03-26 00:59:19,380][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 00:59:19,381][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:59:24,171][__main__][INFO] - Number of regex retries in iteration 713: 0 [2026-03-26 00:59:24,172][__main__][INFO] - agents played in iteration 713 are Bob, Alice [2026-03-26 00:59:24,763][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:59:24,824][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 00:59:24,826][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:59:24,827][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:59:25,520][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:59:26,134][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:59:26,795][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:59:27,453][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:59:28,111][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:59:28,770][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:59:29,428][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:59:30,086][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:59:30,744][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:59:31,402][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:59:32,060][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:59:32,719][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:59:33,377][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:59:34,035][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:59:34,694][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:59:35,352][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:59:36,011][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:59:36,669][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:59:37,328][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:59:37,986][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:59:38,644][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:59:39,303][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:59:39,961][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:59:40,619][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:59:41,279][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:59:41,937][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:59:42,596][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:59:43,254][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:59:43,913][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:59:44,571][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:59:45,230][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:59:45,889][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:59:46,547][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:59:47,206][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:59:47,864][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:59:48,523][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:59:49,181][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:59:49,839][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:59:50,498][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:59:51,157][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:59:51,816][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:59:52,475][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:59:53,133][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:59:53,792][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:59:54,450][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:59:55,108][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:59:55,767][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:59:56,425][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:59:57,335][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:59:57,995][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:59:58,654][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:59:59,313][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:59:59,971][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:00:00,630][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:00:01,288][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:00:01,947][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:00:02,605][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:00:03,264][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:00:03,923][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:00:04,581][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:00:05,239][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:00:05,899][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:00:06,558][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:00:07,217][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:00:07,877][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:00:08,650][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 01:00:10,079][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:00:10,082][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:00:10,084][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:00:11,656][__main__][INFO] - Iteration 714 took 52s (9.17% Gen, 87.82% Train). Generation: 4s, Training: 45s. Estimated remaining time: 3h 37m 11s. Estimated total time: 14h 31m 17s. Time estimates for 10 more iterations: 8m 42s, 100 more iterations: 1h 27m 7s, 500 more iterations: 7h 15m 38s. [2026-03-26 01:00:11,659][__main__][INFO] - Starting iteration 714. [2026-03-26 01:00:11,663][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 01:00:11,664][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:00:17,280][__main__][INFO] - Number of regex retries in iteration 714: 0 [2026-03-26 01:00:17,282][__main__][INFO] - agents played in iteration 714 are Bob, Alice [2026-03-26 01:00:17,773][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:00:17,835][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:00:17,836][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:00:17,837][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:00:18,500][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:00:19,109][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:00:19,769][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:00:20,426][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:00:21,085][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:00:21,744][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:00:22,402][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:00:23,060][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:00:23,718][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:00:24,377][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:00:25,035][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:00:25,693][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:00:26,352][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:00:27,010][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:00:27,669][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:00:28,328][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:00:28,988][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:00:29,648][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:00:30,307][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:00:30,964][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:00:31,624][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:00:32,283][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:00:32,941][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:00:33,600][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:00:34,258][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:00:34,917][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:00:35,576][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:00:36,235][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:00:36,894][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:00:37,552][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:00:38,210][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:00:38,869][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:00:39,527][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:00:40,185][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:00:40,844][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:00:41,503][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:00:42,161][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:00:42,820][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:00:43,479][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:00:44,138][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:00:44,796][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:00:45,455][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:00:46,113][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:00:46,772][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:00:47,430][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:00:48,089][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:00:48,748][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:00:49,406][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:00:50,400][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:00:51,060][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:00:51,718][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:00:52,376][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:00:53,035][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:00:53,692][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:00:54,349][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:00:55,008][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:00:55,665][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:00:56,322][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:00:56,981][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:00:57,638][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:00:58,297][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:00:58,954][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:00:59,612][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:01:00,269][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:01:00,927][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:01:01,696][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 01:01:03,028][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:01:03,030][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:01:03,031][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:01:04,463][__main__][INFO] - Iteration 715 took 52s (10.64% Gen, 86.65% Train). Generation: 5s, Training: 45s. Estimated remaining time: 3h 45m 2s. Estimated total time: 14h 40m 1s. Time estimates for 10 more iterations: 8m 48s, 100 more iterations: 1h 28m 0s, 500 more iterations: 7h 20m 0s. [2026-03-26 01:01:04,465][__main__][INFO] - Starting iteration 715. [2026-03-26 01:01:04,469][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 01:01:04,470][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:01:09,272][__main__][INFO] - Number of regex retries in iteration 715: 0 [2026-03-26 01:01:09,273][__main__][INFO] - agents played in iteration 715 are Bob, Alice [2026-03-26 01:01:09,808][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:01:09,870][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:01:09,870][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:01:09,871][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:01:10,566][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:01:11,185][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:01:11,844][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:01:12,501][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:01:13,158][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:01:13,815][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:01:14,472][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:01:15,129][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:01:15,786][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:01:16,443][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:01:17,100][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:01:17,758][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:01:18,415][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:01:19,073][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:01:19,730][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:01:20,388][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:01:21,044][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:01:21,702][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:01:22,359][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:01:23,017][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:01:23,674][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:01:24,331][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:01:24,988][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:01:25,645][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:01:26,303][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:01:26,960][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:01:27,617][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:01:28,275][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:01:28,932][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:01:29,589][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:01:30,247][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:01:30,904][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:01:31,561][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:01:32,219][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:01:32,876][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:01:33,533][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:01:34,191][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:01:34,848][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:01:35,505][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:01:36,164][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:01:36,822][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:01:37,479][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:01:38,136][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:01:38,793][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:01:39,451][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:01:40,108][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:01:40,766][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:01:41,423][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:01:42,376][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:01:43,034][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:01:43,691][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:01:44,349][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:01:45,006][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:01:45,664][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:01:46,321][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:01:46,979][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:01:47,636][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:01:48,293][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:01:48,951][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:01:49,608][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:01:50,265][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:01:50,922][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:01:51,579][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:01:52,237][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:01:52,896][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:01:53,622][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 01:01:54,960][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:01:54,963][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:01:54,964][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:01:56,412][__main__][INFO] - Iteration 716 took 51s (9.25% Gen, 87.96% Train). Generation: 4s, Training: 45s. Estimated remaining time: 3h 29m 53s. Estimated total time: 14h 25m 44s. Time estimates for 10 more iterations: 8m 39s, 100 more iterations: 1h 26m 34s, 500 more iterations: 7h 12m 52s. [2026-03-26 01:01:56,415][__main__][INFO] - Starting iteration 716. [2026-03-26 01:01:56,419][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 01:01:56,419][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:02:02,063][__main__][INFO] - Number of regex retries in iteration 716: 0 [2026-03-26 01:02:02,064][__main__][INFO] - agents played in iteration 716 are Bob, Alice [2026-03-26 01:02:03,100][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:02:03,162][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:02:03,163][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:02:03,164][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:02:03,857][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:02:04,490][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:02:05,149][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:02:05,807][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:02:06,463][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:02:07,120][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:02:07,778][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:02:08,435][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:02:09,091][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:02:09,749][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:02:10,406][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:02:11,062][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:02:11,719][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:02:12,376][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:02:13,034][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:02:13,690][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:02:14,348][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:02:15,005][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:02:15,661][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:02:16,318][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:02:16,976][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:02:17,633][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:02:18,290][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:02:18,948][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:02:19,605][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:02:20,262][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:02:20,919][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:02:21,576][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:02:22,234][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:02:22,891][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:02:23,548][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:02:24,205][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:02:24,863][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:02:25,520][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:02:26,177][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:02:26,835][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:02:27,492][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:02:28,149][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:02:28,806][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:02:29,464][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:02:30,121][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:02:30,779][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:02:31,436][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:02:32,093][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:02:32,751][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:02:33,408][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:02:34,065][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:02:34,723][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:02:35,607][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:02:36,265][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:02:36,923][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:02:37,580][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:02:38,237][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:02:38,895][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:02:39,553][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:02:40,211][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:02:40,869][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:02:41,527][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:02:42,185][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:02:42,843][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:02:43,501][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:02:44,159][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:02:44,816][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:02:45,473][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:02:46,131][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:02:46,844][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:42 [2026-03-26 01:02:48,183][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:02:48,186][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:02:48,187][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:02:49,650][__main__][INFO] - Iteration 717 took 53s (10.60% Gen, 86.64% Train). Generation: 5s, Training: 46s. Estimated remaining time: 3h 50m 28s. Estimated total time: 14h 47m 12s. Time estimates for 10 more iterations: 8m 52s, 100 more iterations: 1h 28m 43s, 500 more iterations: 7h 23m 36s. [2026-03-26 01:02:49,652][__main__][INFO] - Starting iteration 717. [2026-03-26 01:02:49,656][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 01:02:49,657][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:02:54,617][__main__][INFO] - Number of regex retries in iteration 717: 0 [2026-03-26 01:02:54,619][__main__][INFO] - agents played in iteration 717 are Bob, Alice [2026-03-26 01:02:55,127][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:02:55,188][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:02:55,189][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:02:55,190][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:02:55,859][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:02:56,465][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:02:57,124][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:02:57,780][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:02:58,438][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:02:59,095][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:02:59,752][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:03:00,409][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:03:01,066][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:03:01,723][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:03:02,380][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:03:03,037][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:03:03,694][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:03:04,351][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:03:05,008][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:03:05,666][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:03:06,323][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:03:06,980][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:03:07,638][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:03:08,295][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:03:08,952][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:03:09,609][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:03:10,267][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:03:10,923][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:03:11,581][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:03:12,239][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:03:12,897][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:03:14,784][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:03:15,441][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:03:16,098][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:03:16,754][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:03:17,411][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:03:18,069][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:03:18,726][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:03:19,383][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:03:20,040][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:03:20,697][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:03:21,354][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:03:22,012][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:03:22,669][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:03:23,326][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:03:23,984][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:03:24,641][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:03:25,298][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:03:25,956][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:03:26,613][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:03:27,271][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:03:27,928][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:03:28,857][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:03:29,514][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:03:30,172][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:03:30,830][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:03:31,487][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:03:32,144][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:03:32,802][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:03:33,459][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:03:34,116][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:03:34,774][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:03:35,432][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:03:36,089][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:03:36,746][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:03:37,404][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:03:38,061][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:03:38,720][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:03:39,377][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:03:40,074][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:44 [2026-03-26 01:03:41,423][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:03:41,426][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:03:41,427][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:03:42,965][__main__][INFO] - Iteration 718 took 53s (9.31% Gen, 87.80% Train). Generation: 4s, Training: 46s. Estimated remaining time: 3h 50m 53s. Estimated total time: 14h 48m 30s. Time estimates for 10 more iterations: 8m 53s, 100 more iterations: 1h 28m 51s, 500 more iterations: 7h 24m 15s. [2026-03-26 01:03:42,968][__main__][INFO] - Starting iteration 718. [2026-03-26 01:03:42,971][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 01:03:42,972][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:03:47,834][__main__][INFO] - Number of regex retries in iteration 718: 0 [2026-03-26 01:03:47,835][__main__][INFO] - agents played in iteration 718 are Bob, Alice [2026-03-26 01:03:48,343][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:03:48,405][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:03:48,406][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:03:48,407][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:03:49,065][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:03:49,678][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:03:50,336][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:03:50,993][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:03:51,650][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:03:52,307][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:03:52,964][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:03:53,621][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:03:54,278][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:03:54,935][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:03:55,592][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:03:56,249][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:03:56,906][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:03:57,563][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:03:58,221][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:03:58,879][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:03:59,536][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:04:00,194][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:04:00,852][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:04:01,509][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:04:02,166][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:04:02,824][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:04:03,481][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:04:04,139][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:04:04,796][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:04:05,455][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:04:06,112][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:04:06,770][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:04:07,427][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:04:08,085][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:04:08,742][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:04:09,400][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:04:10,058][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:04:10,715][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:04:11,373][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:04:12,031][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:04:12,689][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:04:13,347][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:04:14,004][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:04:14,662][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:04:15,319][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:04:15,977][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:04:16,634][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:04:17,291][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:04:17,949][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:04:18,607][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:04:19,264][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:04:19,922][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:04:20,808][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:04:21,467][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:04:22,124][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:04:22,782][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:04:23,439][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:04:24,098][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:04:24,755][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:04:25,413][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:04:26,070][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:04:26,727][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:04:27,386][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:04:28,043][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:04:28,701][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:04:29,358][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:04:30,016][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:04:30,675][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:04:31,333][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:04:32,126][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 01:04:33,457][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:04:33,462][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:04:33,463][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:04:34,883][__main__][INFO] - Iteration 719 took 51s (9.37% Gen, 87.89% Train). Generation: 4s, Training: 45s. Estimated remaining time: 3h 26m 44s. Estimated total time: 14h 25m 13s. Time estimates for 10 more iterations: 8m 39s, 100 more iterations: 1h 26m 31s, 500 more iterations: 7h 12m 36s. [2026-03-26 01:04:34,886][__main__][INFO] - Starting iteration 719. [2026-03-26 01:04:34,890][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 01:04:34,891][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:04:39,717][__main__][INFO] - Number of regex retries in iteration 719: 0 [2026-03-26 01:04:39,718][__main__][INFO] - agents played in iteration 719 are Bob, Alice [2026-03-26 01:04:40,317][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:04:40,378][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:04:40,379][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:04:40,380][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:04:41,075][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:04:41,684][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:04:42,344][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:04:43,003][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:04:43,661][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:04:44,319][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:04:44,977][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:04:45,636][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:04:46,294][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:04:46,952][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:04:47,611][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:04:48,269][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:04:48,927][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:04:49,586][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:04:50,244][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:04:50,902][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:04:51,561][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:04:52,220][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:04:52,878][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:04:53,537][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:04:54,196][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:04:54,854][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:04:55,513][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:04:56,171][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:04:56,830][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:04:57,488][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:04:58,146][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:04:58,805][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:04:59,464][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:05:00,122][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:05:00,780][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:05:01,439][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:05:02,098][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:05:02,756][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:05:03,414][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:05:04,073][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:05:04,731][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:05:05,390][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:05:06,049][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:05:06,708][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:05:07,367][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:05:08,025][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:05:08,683][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:05:09,342][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:05:10,001][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:05:10,660][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:05:11,318][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:05:11,977][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:05:12,897][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:05:13,555][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:05:14,213][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:05:14,871][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:05:15,529][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:05:16,186][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:05:16,843][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:05:17,500][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:05:18,158][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:05:18,816][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:05:19,474][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:05:20,131][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:05:20,789][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:05:21,446][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:05:22,104][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:05:22,762][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:05:23,420][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:05:24,124][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 01:05:25,481][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:05:25,484][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:05:25,485][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:05:26,778][__main__][INFO] - Iteration 720 took 51s (9.30% Gen, 88.20% Train). Generation: 4s, Training: 45s. Estimated remaining time: 3h 25m 29s. Estimated total time: 14h 24m 50s. Time estimates for 10 more iterations: 8m 38s, 100 more iterations: 1h 26m 29s, 500 more iterations: 7h 12m 25s. [2026-03-26 01:05:26,781][__main__][INFO] - Starting iteration 720. [2026-03-26 01:05:26,785][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 01:05:26,786][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:05:32,315][__main__][INFO] - Number of regex retries in iteration 720: 0 [2026-03-26 01:05:32,317][__main__][INFO] - agents played in iteration 720 are Bob, Alice [2026-03-26 01:05:33,422][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:05:33,485][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:05:33,486][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:05:33,487][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:05:34,143][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:05:34,761][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:05:35,421][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:05:36,079][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:05:36,737][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:05:37,396][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:05:38,055][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:05:38,714][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:05:39,373][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:05:40,032][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:05:40,690][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:05:41,349][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:05:42,008][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:05:42,666][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:05:43,324][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:05:43,983][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:05:44,642][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:05:45,300][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:05:45,959][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:05:46,617][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:05:47,276][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:05:47,934][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:05:48,593][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:05:49,252][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:05:49,911][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:05:50,568][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:05:51,227][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:05:51,885][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:05:52,543][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:05:53,200][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:05:53,858][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:05:54,515][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:05:55,172][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:05:55,830][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:05:56,488][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:05:57,145][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:05:57,803][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:05:58,460][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:05:59,118][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:05:59,775][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:06:00,433][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:06:01,090][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:06:01,748][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:06:02,406][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:06:03,063][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:06:03,721][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:06:04,378][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:06:05,036][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:06:05,944][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:06:06,602][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:06:07,259][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:06:07,916][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:06:08,574][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:06:09,232][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:06:09,890][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:06:10,547][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:06:11,205][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:06:11,863][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:06:12,521][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:06:13,178][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:06:13,836][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:06:14,493][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:06:15,151][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:06:15,809][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:06:16,467][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:06:17,229][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 01:06:18,554][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:06:18,557][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:06:18,559][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:06:19,983][__main__][INFO] - Iteration 721 took 53s (10.39% Gen, 86.92% Train). Generation: 5s, Training: 46s. Estimated remaining time: 3h 46m 25s. Estimated total time: 14h 46m 40s. Time estimates for 10 more iterations: 8m 52s, 100 more iterations: 1h 28m 40s, 500 more iterations: 7h 23m 20s. [2026-03-26 01:06:19,986][__main__][INFO] - Starting iteration 721. [2026-03-26 01:06:19,991][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 01:06:19,991][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:06:25,668][__main__][INFO] - Number of regex retries in iteration 721: 0 [2026-03-26 01:06:25,670][__main__][INFO] - agents played in iteration 721 are Bob, Alice [2026-03-26 01:06:26,213][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:06:26,274][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:06:26,275][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:06:26,276][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:06:26,985][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:06:27,588][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:06:28,247][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:06:28,904][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:06:29,561][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:06:30,218][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:06:30,875][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:06:31,532][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:06:32,189][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:06:32,845][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:06:33,503][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:06:34,160][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:06:34,817][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:06:35,475][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:06:36,132][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:06:36,789][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:06:37,447][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:06:38,104][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:06:38,762][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:06:39,420][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:06:40,077][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:06:40,734][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:06:41,392][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:06:42,049][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:06:42,707][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:06:43,364][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:06:44,022][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:06:44,679][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:06:45,336][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:06:45,993][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:06:46,652][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:06:47,309][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:06:47,966][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:06:48,624][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:06:49,282][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:06:49,939][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:06:50,596][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:06:51,254][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:06:51,912][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:06:52,569][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:06:53,227][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:06:53,884][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:06:54,541][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:06:55,199][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:06:55,856][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:06:56,514][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:06:57,172][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:06:57,832][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:06:58,748][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:06:59,408][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:07:00,066][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:07:00,724][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:07:01,382][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:07:02,040][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:07:02,698][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:07:03,356][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:07:04,014][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:07:04,671][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:07:05,329][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:07:05,987][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:07:06,644][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:07:07,302][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:07:07,960][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:07:08,617][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:07:09,275][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:07:09,972][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:42 [2026-03-26 01:07:11,354][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:07:11,357][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:07:11,358][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:07:12,790][__main__][INFO] - Iteration 722 took 52s (10.75% Gen, 86.53% Train). Generation: 5s, Training: 45s. Estimated remaining time: 3h 38m 55s. Estimated total time: 14h 40m 2s. Time estimates for 10 more iterations: 8m 48s, 100 more iterations: 1h 28m 0s, 500 more iterations: 7h 20m 1s. [2026-03-26 01:07:12,793][__main__][INFO] - Starting iteration 722. [2026-03-26 01:07:12,797][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 01:07:12,797][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:07:17,495][__main__][INFO] - Number of regex retries in iteration 722: 0 [2026-03-26 01:07:17,496][__main__][INFO] - agents played in iteration 722 are Bob, Alice [2026-03-26 01:07:18,115][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:07:18,177][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:07:18,178][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:07:18,179][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:07:18,871][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:07:19,478][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:07:20,137][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:07:20,794][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:07:21,451][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:07:22,108][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:07:22,765][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:07:23,423][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:07:24,080][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:07:24,737][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:07:25,394][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:07:26,051][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:07:26,708][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:07:27,365][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:07:28,023][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:07:28,681][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:07:29,338][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:07:29,995][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:07:30,652][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:07:31,310][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:07:31,967][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:07:32,624][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:07:33,282][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:07:33,939][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:07:34,596][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:07:35,254][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:07:35,911][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:07:36,569][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:07:37,226][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:07:37,883][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:07:38,540][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:07:39,198][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:07:39,855][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:07:40,513][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:07:41,170][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:07:41,828][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:07:42,486][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:07:43,144][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:07:43,803][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:07:44,460][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:07:45,117][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:07:45,776][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:07:46,433][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:07:47,091][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:07:47,749][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:07:49,664][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:07:50,322][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:07:50,979][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:07:51,876][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:07:52,534][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:07:53,191][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:07:53,848][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:07:54,505][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:07:55,163][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:07:55,821][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:07:56,478][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:07:57,136][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:07:57,793][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:07:58,451][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:07:59,110][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:07:59,767][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:08:00,424][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:08:01,082][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:08:01,740][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:08:02,397][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:08:03,103][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:44 [2026-03-26 01:08:04,439][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:08:04,442][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:08:04,443][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:08:05,904][__main__][INFO] - Iteration 723 took 53s (8.85% Gen, 88.40% Train). Generation: 4s, Training: 46s. Estimated remaining time: 3h 43m 8s. Estimated total time: 14h 45m 8s. Time estimates for 10 more iterations: 8m 51s, 100 more iterations: 1h 28m 30s, 500 more iterations: 7h 22m 34s. [2026-03-26 01:08:05,906][__main__][INFO] - Starting iteration 723. [2026-03-26 01:08:05,910][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 01:08:05,911][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:08:12,840][__main__][INFO] - Number of regex retries in iteration 723: 0 [2026-03-26 01:08:12,842][__main__][INFO] - agents played in iteration 723 are Bob, Alice [2026-03-26 01:08:13,350][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:08:13,411][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:08:13,412][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:08:13,413][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:08:14,075][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:08:14,687][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:08:15,346][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:08:16,002][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:08:16,660][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:08:17,316][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:08:17,974][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:08:18,630][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:08:19,287][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:08:19,944][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:08:20,602][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:08:21,259][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:08:21,916][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:08:22,574][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:08:23,231][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:08:23,888][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:08:24,546][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:08:25,204][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:08:25,862][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:08:26,520][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:08:27,177][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:08:27,834][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:08:28,493][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:08:29,150][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:08:29,807][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:08:30,464][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:08:31,122][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:08:31,779][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:08:32,437][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:08:33,095][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:08:33,752][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:08:34,410][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:08:35,067][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:08:35,725][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:08:36,383][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:08:37,040][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:08:37,698][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:08:38,356][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:08:39,013][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:08:39,670][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:08:40,329][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:08:40,986][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:08:41,643][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:08:42,301][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:08:42,958][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:08:43,616][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:08:44,274][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:08:44,931][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:08:45,853][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:08:46,511][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:08:47,169][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:08:47,826][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:08:48,484][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:08:49,141][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:08:49,799][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:08:50,456][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:08:51,113][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:08:51,771][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:08:52,428][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:08:53,086][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:08:53,743][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:08:54,401][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:08:55,058][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:08:55,716][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:08:56,373][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:08:57,070][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 01:08:58,397][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:08:58,400][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:08:58,402][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:08:59,729][__main__][INFO] - Iteration 724 took 53s (12.88% Gen, 84.65% Train). Generation: 6s, Training: 45s. Estimated remaining time: 3h 54m 6s. Estimated total time: 14h 57m 1s. Time estimates for 10 more iterations: 8m 58s, 100 more iterations: 1h 29m 42s, 500 more iterations: 7h 28m 30s. [2026-03-26 01:08:59,732][__main__][INFO] - Starting iteration 724. [2026-03-26 01:08:59,737][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 01:08:59,737][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:09:04,549][__main__][INFO] - Number of regex retries in iteration 724: 0 [2026-03-26 01:09:04,550][__main__][INFO] - agents played in iteration 724 are Bob, Alice [2026-03-26 01:09:05,170][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:09:05,234][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:09:05,235][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:09:05,235][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:09:05,902][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:09:06,508][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:09:07,167][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:09:07,824][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:09:08,482][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:09:09,139][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:09:09,796][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:09:10,453][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:09:11,111][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:09:11,769][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:09:12,426][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:09:13,083][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:09:13,741][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:09:14,398][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:09:15,056][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:09:15,713][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:09:16,370][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:09:17,027][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:09:17,685][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:09:18,342][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:09:18,999][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:09:19,657][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:09:20,314][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:09:20,971][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:09:21,629][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:09:22,287][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:09:22,944][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:09:23,601][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:09:24,259][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:09:24,916][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:09:25,573][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:09:26,231][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:09:26,888][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:09:27,546][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:09:28,203][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:09:28,861][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:09:29,520][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:09:30,177][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:09:30,835][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:09:31,493][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:09:32,150][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:09:32,808][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:09:33,465][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:09:34,123][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:09:34,780][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:09:35,438][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:09:36,096][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:09:36,754][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:09:37,706][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:09:38,364][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:09:39,022][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:09:39,679][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:09:40,337][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:09:40,995][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:09:41,652][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:09:42,309][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:09:42,967][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:09:43,625][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:09:44,282][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:09:44,940][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:09:45,598][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:09:46,255][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:09:46,913][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:09:47,570][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:09:48,227][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:09:48,920][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 01:09:50,304][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:09:50,308][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:09:50,309][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:09:51,543][__main__][INFO] - Iteration 725 took 51s (9.29% Gen, 88.32% Train). Generation: 4s, Training: 45s. Estimated remaining time: 3h 19m 42s. Estimated total time: 14h 23m 28s. Time estimates for 10 more iterations: 8m 38s, 100 more iterations: 1h 26m 20s, 500 more iterations: 7h 11m 44s. [2026-03-26 01:09:51,545][__main__][INFO] - Starting iteration 725. [2026-03-26 01:09:51,549][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 01:09:51,550][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:09:57,511][__main__][INFO] - Number of regex retries in iteration 725: 0 [2026-03-26 01:09:57,512][__main__][INFO] - agents played in iteration 725 are Bob, Alice [2026-03-26 01:09:58,422][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:09:58,486][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:09:58,487][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:09:58,487][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:09:59,154][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:09:59,771][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:10:00,431][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:10:01,088][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:10:01,746][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:10:02,404][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:10:03,062][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:10:03,720][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:10:04,379][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:10:05,036][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:10:05,695][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:10:06,353][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:10:07,011][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:10:07,669][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:10:08,327][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:10:08,985][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:10:09,643][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:10:10,302][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:10:10,961][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:10:11,619][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:10:12,277][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:10:12,936][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:10:13,594][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:10:14,252][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:10:14,911][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:10:15,569][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:10:16,227][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:10:16,886][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:10:17,544][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:10:19,184][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:10:19,842][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:10:20,499][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:10:21,156][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:10:21,814][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:10:22,471][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:10:23,129][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:10:23,787][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:10:24,445][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:10:25,102][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:10:25,760][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:10:26,417][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:10:27,074][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:10:27,732][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:10:28,390][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:10:29,047][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:10:29,705][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:10:30,362][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:10:31,020][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:10:31,941][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:10:32,600][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:10:33,258][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:10:33,916][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:10:34,574][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:10:35,233][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:10:35,891][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:10:36,548][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:10:37,206][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:10:37,863][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:10:38,521][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:10:39,179][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:10:39,837][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:10:40,495][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:10:41,153][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:10:41,810][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:10:42,468][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:10:43,159][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:44 [2026-03-26 01:10:44,498][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:10:44,502][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:10:44,503][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:10:45,905][__main__][INFO] - Iteration 726 took 54s (10.97% Gen, 86.45% Train). Generation: 5s, Training: 46s. Estimated remaining time: 4h 1m 17s. Estimated total time: 15h 5m 57s. Time estimates for 10 more iterations: 9m 3s, 100 more iterations: 1h 30m 35s, 500 more iterations: 7h 32m 58s. [2026-03-26 01:10:45,908][__main__][INFO] - Starting iteration 726. [2026-03-26 01:10:45,911][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 01:10:45,912][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:10:51,189][__main__][INFO] - Number of regex retries in iteration 726: 0 [2026-03-26 01:10:51,191][__main__][INFO] - agents played in iteration 726 are Bob, Alice [2026-03-26 01:10:51,794][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:10:51,856][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:10:51,857][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:10:51,858][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:10:52,535][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:10:53,146][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:10:53,806][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:10:54,464][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:10:55,122][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:10:55,780][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:10:56,438][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:10:57,096][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:10:57,755][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:10:58,413][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:10:59,072][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:10:59,730][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:11:00,388][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:11:01,046][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:11:01,704][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:11:02,362][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:11:03,021][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:11:03,679][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:11:04,337][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:11:04,996][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:11:05,654][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:11:06,312][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:11:06,971][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:11:07,629][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:11:08,287][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:11:08,946][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:11:09,604][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:11:10,262][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:11:10,921][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:11:11,580][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:11:12,238][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:11:12,896][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:11:13,554][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:11:14,213][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:11:14,872][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:11:15,530][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:11:16,190][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:11:16,848][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:11:17,506][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:11:18,164][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:11:18,824][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:11:19,482][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:11:20,141][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:11:20,799][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:11:21,458][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:11:22,117][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:11:22,775][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:11:23,434][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:11:24,358][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:11:25,017][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:11:25,675][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:11:26,332][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:11:26,990][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:11:27,647][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:11:28,305][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:11:28,963][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:11:29,620][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:11:30,277][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:11:30,935][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:11:31,592][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:11:32,250][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:11:32,908][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:11:33,565][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:11:34,222][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:11:34,880][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:11:35,627][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 01:11:36,974][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:11:36,977][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:11:36,978][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:11:38,483][__main__][INFO] - Iteration 727 took 52s (10.04% Gen, 87.09% Train). Generation: 5s, Training: 45s. Estimated remaining time: 3h 30m 40s. Estimated total time: 14h 36m 13s. Time estimates for 10 more iterations: 8m 45s, 100 more iterations: 1h 27m 37s, 500 more iterations: 7h 18m 6s. [2026-03-26 01:11:38,485][__main__][INFO] - Starting iteration 727. [2026-03-26 01:11:38,489][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 01:11:38,490][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:11:43,202][__main__][INFO] - Number of regex retries in iteration 727: 0 [2026-03-26 01:11:43,204][__main__][INFO] - agents played in iteration 727 are Bob, Alice [2026-03-26 01:11:43,701][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:11:43,764][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:11:43,764][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:11:43,765][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:11:44,436][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:11:45,044][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:11:45,703][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:11:46,360][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:11:47,017][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:11:47,674][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:11:48,331][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:11:48,988][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:11:49,645][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:11:50,302][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:11:50,959][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:11:51,616][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:11:52,274][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:11:52,931][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:11:53,588][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:11:54,246][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:11:54,903][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:11:55,560][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:11:56,217][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:11:56,874][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:11:57,532][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:11:58,190][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:11:58,848][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:11:59,506][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:12:00,164][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:12:00,822][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:12:01,479][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:12:02,137][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:12:02,795][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:12:03,452][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:12:04,109][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:12:04,766][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:12:05,424][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:12:06,083][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:12:06,741][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:12:07,399][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:12:08,056][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:12:08,714][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:12:09,371][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:12:10,029][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:12:10,687][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:12:11,344][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:12:12,001][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:12:12,659][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:12:13,316][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:12:13,975][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:12:14,631][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:12:15,288][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:12:16,180][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:12:16,838][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:12:17,495][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:12:18,152][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:12:18,810][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:12:19,467][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:12:20,124][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:12:20,782][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:12:21,439][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:12:22,097][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:12:22,754][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:12:23,411][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:12:24,069][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:12:24,727][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:12:25,384][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:12:26,042][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:12:26,699][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:12:27,396][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:42 [2026-03-26 01:12:28,724][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:12:28,727][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:12:28,728][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:12:30,105][__main__][INFO] - Iteration 728 took 51s (9.13% Gen, 88.20% Train). Generation: 4s, Training: 45s. Estimated remaining time: 3h 13m 52s. Estimated total time: 14h 20m 17s. Time estimates for 10 more iterations: 8m 36s, 100 more iterations: 1h 26m 1s, 500 more iterations: 7h 10m 8s. [2026-03-26 01:12:30,107][__main__][INFO] - Starting iteration 728. [2026-03-26 01:12:30,111][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 01:12:30,112][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:12:35,016][__main__][INFO] - Number of regex retries in iteration 728: 0 [2026-03-26 01:12:35,018][__main__][INFO] - agents played in iteration 728 are Bob, Alice [2026-03-26 01:12:35,619][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:12:35,682][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:12:35,682][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:12:35,683][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:12:36,386][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:12:36,992][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:12:37,651][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:12:38,307][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:12:38,965][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:12:39,621][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:12:40,278][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:12:40,935][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:12:41,593][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:12:42,251][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:12:42,908][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:12:43,565][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:12:44,222][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:12:44,880][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:12:45,538][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:12:46,195][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:12:46,852][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:12:47,509][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:12:48,167][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:12:48,824][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:12:49,482][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:12:50,139][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:12:50,796][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:12:51,454][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:12:52,111][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:12:52,768][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:12:53,426][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:12:54,083][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:12:54,740][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:12:55,397][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:12:56,054][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:12:56,712][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:12:57,369][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:12:58,026][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:12:58,684][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:12:59,342][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:12:59,999][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:13:00,656][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:13:01,314][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:13:01,971][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:13:02,628][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:13:03,286][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:13:03,943][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:13:04,600][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:13:05,258][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:13:05,915][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:13:06,572][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:13:07,230][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:13:08,173][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:13:08,831][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:13:09,488][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:13:10,145][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:13:10,803][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:13:11,460][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:13:12,117][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:13:12,774][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:13:13,432][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:13:14,089][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:13:14,747][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:13:15,404][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:13:16,062][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:13:16,719][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:13:17,376][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:13:18,034][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:13:18,691][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:13:19,407][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 01:13:20,762][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:13:20,765][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:13:20,766][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:13:22,359][__main__][INFO] - Iteration 729 took 52s (9.39% Gen, 87.56% Train). Generation: 4s, Training: 45s. Estimated remaining time: 3h 23m 33s. Estimated total time: 14h 30m 50s. Time estimates for 10 more iterations: 8m 42s, 100 more iterations: 1h 27m 5s, 500 more iterations: 7h 15m 25s. [2026-03-26 01:13:22,362][__main__][INFO] - Starting iteration 729. [2026-03-26 01:13:22,365][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 01:13:22,366][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:13:28,161][__main__][INFO] - Number of regex retries in iteration 729: 0 [2026-03-26 01:13:28,162][__main__][INFO] - agents played in iteration 729 are Bob, Alice [2026-03-26 01:13:29,041][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:13:29,107][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:13:29,108][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:13:29,108][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:13:29,792][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:13:30,409][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:13:31,069][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:13:31,727][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:13:32,385][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:13:33,043][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:13:33,701][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:13:34,359][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:13:35,017][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:13:35,676][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:13:36,333][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:13:36,992][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:13:37,649][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:13:38,307][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:13:38,966][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:13:39,624][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:13:40,282][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:13:40,940][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:13:41,598][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:13:42,256][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:13:42,914][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:13:43,573][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:13:44,231][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:13:44,888][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:13:45,547][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:13:46,205][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:13:46,863][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:13:47,521][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:13:48,179][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:13:48,838][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:13:49,496][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:13:50,154][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:13:50,812][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:13:51,470][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:13:52,128][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:13:52,788][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:13:53,445][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:13:54,103][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:13:54,761][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:13:55,420][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:13:56,078][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:13:56,736][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:13:57,395][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:13:58,053][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:13:58,711][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:13:59,369][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:14:00,028][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:14:00,686][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:14:01,578][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:14:02,236][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:14:02,893][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:14:03,550][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:14:04,208][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:14:04,865][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:14:05,523][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:14:06,180][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:14:06,838][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:14:07,496][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:14:08,153][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:14:08,811][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:14:09,469][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:14:10,126][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:14:10,784][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:14:11,442][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:14:12,099][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:14:12,819][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 01:14:14,257][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:14:14,261][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:14:14,296][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:14:15,938][__main__][INFO] - Iteration 730 took 53s (10.82% Gen, 86.11% Train). Generation: 5s, Training: 46s. Estimated remaining time: 3h 44m 44s. Estimated total time: 14h 52m 54s. Time estimates for 10 more iterations: 8m 55s, 100 more iterations: 1h 29m 17s, 500 more iterations: 7h 26m 27s. [2026-03-26 01:14:15,941][__main__][INFO] - Starting iteration 730. [2026-03-26 01:14:15,946][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 01:14:15,946][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:14:22,098][__main__][INFO] - Number of regex retries in iteration 730: 0 [2026-03-26 01:14:22,100][__main__][INFO] - agents played in iteration 730 are Bob, Alice [2026-03-26 01:14:22,705][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:14:22,766][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:14:22,767][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:14:22,768][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:14:23,431][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:14:24,041][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:14:24,698][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:14:25,355][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:14:26,012][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:14:26,669][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:14:27,326][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:14:27,983][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:14:28,640][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:14:29,298][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:14:29,954][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:14:30,612][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:14:31,269][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:14:31,926][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:14:32,583][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:14:33,240][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:14:33,897][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:14:34,554][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:14:35,212][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:14:35,869][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:14:36,527][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:14:37,184][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:14:37,841][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:14:38,503][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:14:39,162][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:14:39,821][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:14:40,479][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:14:41,138][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:14:41,797][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:14:42,456][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:14:43,114][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:14:43,773][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:14:44,432][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:14:45,091][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:14:45,750][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:14:46,408][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:14:47,066][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:14:47,724][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:14:48,384][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:14:49,042][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:14:49,701][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:14:50,360][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:14:51,019][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:14:51,677][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:14:52,336][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:14:52,995][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:14:53,654][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:14:54,313][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:14:55,249][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:14:55,909][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:14:56,568][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:14:57,226][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:14:57,885][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:14:58,545][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:14:59,206][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:14:59,865][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:15:00,524][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:15:01,183][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:15:01,843][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:15:02,502][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:15:03,161][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:15:03,819][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:15:04,478][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:15:05,136][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:15:05,795][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:15:06,502][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 01:15:07,844][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:15:07,847][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:15:07,848][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:15:09,312][__main__][INFO] - Iteration 731 took 53s (11.53% Gen, 85.72% Train). Generation: 6s, Training: 45s. Estimated remaining time: 3h 40m 25s. Estimated total time: 14h 49m 29s. Time estimates for 10 more iterations: 8m 53s, 100 more iterations: 1h 28m 56s, 500 more iterations: 7h 24m 44s. [2026-03-26 01:15:09,315][__main__][INFO] - Starting iteration 731. [2026-03-26 01:15:09,319][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 01:15:09,320][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:15:14,515][__main__][INFO] - Number of regex retries in iteration 731: 0 [2026-03-26 01:15:14,516][__main__][INFO] - agents played in iteration 731 are Bob, Alice [2026-03-26 01:15:15,008][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:15:15,069][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:15:15,070][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:15:15,071][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:15:15,742][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:15:16,347][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:15:17,005][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:15:17,662][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:15:18,319][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:15:18,976][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:15:19,634][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:15:20,290][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:15:20,947][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:15:21,604][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:15:22,262][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:15:22,919][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:15:23,576][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:15:24,233][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:15:24,890][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:15:25,547][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:15:26,205][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:15:26,862][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:15:27,520][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:15:28,177][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:15:28,834][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:15:29,492][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:15:30,156][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:15:30,814][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:15:31,471][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:15:32,128][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:15:32,785][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:15:33,442][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:15:34,099][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:15:34,757][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:15:35,414][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:15:36,071][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:15:36,729][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:15:37,387][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:15:38,046][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:15:38,703][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:15:39,362][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:15:40,019][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:15:40,677][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:15:41,335][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:15:41,993][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:15:42,650][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:15:43,308][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:15:43,966][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:15:44,623][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:15:45,281][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:15:45,940][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:15:46,600][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:15:47,540][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:15:48,199][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:15:48,857][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:15:49,514][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:15:50,172][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:15:50,829][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:15:51,486][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:15:52,144][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:15:52,802][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:15:53,459][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:15:54,117][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:15:54,774][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:15:55,431][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:15:56,089][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:15:56,746][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:15:57,404][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:15:58,061][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:15:58,757][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 01:16:00,180][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:16:00,184][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:16:00,185][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:16:05,521][__main__][INFO] - Iteration 732 took 56s (9.25% Gen, 81.25% Train). Generation: 5s, Training: 45s. Estimated remaining time: 4h 26m 43s. Estimated total time: 15h 36m 44s. Time estimates for 10 more iterations: 9m 22s, 100 more iterations: 1h 33m 40s, 500 more iterations: 7h 48m 22s. [2026-03-26 01:16:05,523][__main__][INFO] - Starting iteration 732. [2026-03-26 01:16:05,527][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 01:16:05,528][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:16:10,403][__main__][INFO] - Number of regex retries in iteration 732: 0 [2026-03-26 01:16:10,404][__main__][INFO] - agents played in iteration 732 are Bob, Alice [2026-03-26 01:16:10,947][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:16:11,009][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:16:11,010][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:16:11,010][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:16:11,672][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:16:12,277][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:16:12,935][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:16:13,592][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:16:14,249][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:16:14,905][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:16:15,562][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:16:16,219][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:16:16,875][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:16:17,532][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:16:18,189][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:16:18,846][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:16:19,503][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:16:20,160][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:16:20,818][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:16:21,475][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:16:22,132][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:16:22,790][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:16:23,448][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:16:24,105][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:16:24,763][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:16:25,420][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:16:26,078][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:16:26,736][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:16:27,394][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:16:28,051][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:16:28,710][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:16:29,368][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:16:30,025][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:16:30,683][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:16:31,340][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:16:31,998][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:16:32,655][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:16:33,313][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:16:33,970][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:16:34,627][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:16:35,285][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:16:35,942][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:16:36,599][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:16:37,258][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:16:37,915][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:16:38,572][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:16:39,230][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:16:39,887][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:16:40,544][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:16:41,202][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:16:41,859][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:16:42,516][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:16:43,412][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:16:44,070][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:16:44,728][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:16:45,385][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:16:46,042][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:16:46,699][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:16:47,357][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:16:48,014][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:16:48,672][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:16:49,329][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:16:49,986][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:16:50,644][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:16:51,301][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:16:51,958][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:16:52,616][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:16:53,273][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:16:53,931][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:16:54,672][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 01:16:56,024][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:16:56,026][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:16:56,027][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:16:57,413][__main__][INFO] - Iteration 733 took 51s (9.40% Gen, 87.92% Train). Generation: 4s, Training: 45s. Estimated remaining time: 3h 13m 56s. Estimated total time: 14h 24m 48s. Time estimates for 10 more iterations: 8m 38s, 100 more iterations: 1h 26m 28s, 500 more iterations: 7h 12m 24s. [2026-03-26 01:16:57,416][__main__][INFO] - Starting iteration 733. [2026-03-26 01:16:57,420][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 01:16:57,421][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:17:02,087][__main__][INFO] - Number of regex retries in iteration 733: 0 [2026-03-26 01:17:02,087][__main__][INFO] - agents played in iteration 733 are Bob, Alice [2026-03-26 01:17:02,682][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:17:02,744][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:17:02,744][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:17:02,745][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:17:03,442][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:17:04,056][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:17:04,714][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:17:05,371][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:17:06,029][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:17:06,686][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:17:07,344][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:17:08,002][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:17:08,659][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:17:09,316][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:17:09,974][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:17:10,632][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:17:11,289][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:17:11,946][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:17:12,604][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:17:13,262][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:17:13,919][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:17:14,577][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:17:15,234][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:17:15,892][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:17:16,550][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:17:17,207][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:17:17,864][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:17:18,522][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:17:19,179][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:17:19,836][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:17:20,494][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:17:21,151][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:17:21,810][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:17:22,468][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:17:23,125][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:17:23,783][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:17:24,440][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:17:25,098][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:17:25,755][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:17:26,413][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:17:27,071][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:17:27,728][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:17:28,386][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:17:29,044][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:17:29,701][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:17:30,359][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:17:31,017][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:17:31,674][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:17:32,332][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:17:32,989][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:17:33,648][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:17:34,305][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:17:35,258][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:17:35,916][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:17:36,574][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:17:37,231][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:17:37,888][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:17:38,546][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:17:39,204][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:17:39,862][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:17:40,519][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:17:41,176][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:17:41,834][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:17:42,491][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:17:43,149][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:17:43,806][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:17:44,464][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:17:45,121][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:17:45,778][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:17:46,579][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 01:17:47,909][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:17:47,911][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:17:47,913][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:17:49,312][__main__][INFO] - Iteration 734 took 51s (8.99% Gen, 88.31% Train). Generation: 4s, Training: 45s. Estimated remaining time: 3h 13m 9s. Estimated total time: 14h 24m 53s. Time estimates for 10 more iterations: 8m 38s, 100 more iterations: 1h 26m 29s, 500 more iterations: 7h 12m 26s. [2026-03-26 01:17:49,314][__main__][INFO] - Starting iteration 734. [2026-03-26 01:17:49,318][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 01:17:49,319][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:17:54,686][__main__][INFO] - Number of regex retries in iteration 734: 0 [2026-03-26 01:17:54,688][__main__][INFO] - agents played in iteration 734 are Bob, Alice [2026-03-26 01:17:55,594][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:17:55,655][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:17:55,656][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:17:55,656][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:17:56,322][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:17:56,926][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:17:57,584][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:17:58,242][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:17:58,900][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:17:59,557][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:18:00,215][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:18:00,873][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:18:01,531][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:18:02,188][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:18:02,847][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:18:03,504][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:18:04,162][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:18:04,819][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:18:05,477][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:18:06,134][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:18:06,792][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:18:07,450][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:18:08,107][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:18:08,765][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:18:09,422][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:18:10,080][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:18:10,737][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:18:11,395][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:18:12,052][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:18:12,710][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:18:13,367][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:18:14,025][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:18:14,682][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:18:15,339][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:18:15,996][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:18:16,653][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:18:17,311][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:18:17,969][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:18:18,626][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:18:19,284][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:18:19,941][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:18:20,599][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:18:21,257][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:18:21,914][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:18:22,572][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:18:23,229][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:18:23,887][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:18:24,544][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:18:25,202][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:18:25,859][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:18:26,517][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:18:27,174][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:18:28,076][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:18:28,734][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:18:29,391][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:18:30,050][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:18:30,707][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:18:31,365][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:18:32,022][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:18:32,680][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:18:33,338][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:18:33,995][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:18:34,653][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:18:35,311][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:18:35,968][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:18:36,626][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:18:37,284][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:18:37,942][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:18:38,601][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:18:39,304][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:42 [2026-03-26 01:18:40,668][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:18:40,670][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:18:40,672][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:18:42,280][__main__][INFO] - Iteration 735 took 52s (10.14% Gen, 86.82% Train). Generation: 5s, Training: 45s. Estimated remaining time: 3h 30m 7s. Estimated total time: 14h 42m 44s. Time estimates for 10 more iterations: 8m 49s, 100 more iterations: 1h 28m 16s, 500 more iterations: 7h 21m 22s. [2026-03-26 01:18:42,282][__main__][INFO] - Starting iteration 735. [2026-03-26 01:18:42,287][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 01:18:42,288][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:18:48,322][__main__][INFO] - Number of regex retries in iteration 735: 0 [2026-03-26 01:18:48,323][__main__][INFO] - agents played in iteration 735 are Bob, Alice [2026-03-26 01:18:49,449][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:18:49,510][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:18:49,511][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:18:49,512][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:18:50,185][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:18:50,795][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:18:51,454][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:18:52,111][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:18:52,768][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:18:53,427][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:18:54,084][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:18:54,741][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:18:55,399][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:18:56,056][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:18:56,713][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:18:57,370][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:18:58,027][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:18:58,685][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:18:59,342][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:19:00,000][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:19:00,657][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:19:01,314][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:19:01,972][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:19:02,629][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:19:03,286][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:19:03,943][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:19:04,600][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:19:05,258][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:19:05,916][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:19:06,573][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:19:07,231][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:19:07,888][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:19:08,545][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:19:09,203][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:19:09,860][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:19:10,518][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:19:11,176][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:19:11,833][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:19:12,490][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:19:13,147][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:19:13,805][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:19:14,463][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:19:15,120][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:19:15,777][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:19:16,434][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:19:17,092][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:19:17,749][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:19:18,406][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:19:19,064][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:19:19,721][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:19:20,378][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:19:21,036][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:19:21,994][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:19:22,652][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:19:23,844][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:19:24,502][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:19:25,159][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:19:25,817][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:19:26,475][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:19:27,133][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:19:27,791][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:19:28,449][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:19:29,107][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:19:29,764][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:19:30,422][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:19:31,081][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:19:31,738][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:19:32,396][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:19:33,054][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:19:33,779][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 01:19:35,159][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:19:35,163][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:19:35,164][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:19:40,352][__main__][INFO] - Iteration 736 took 58s (10.39% Gen, 80.67% Train). Generation: 6s, Training: 46s. Estimated remaining time: 4h 54m 12s. Estimated total time: 16h 7m 47s. Time estimates for 10 more iterations: 9m 40s, 100 more iterations: 1h 36m 46s, 500 more iterations: 8h 3m 53s. [2026-03-26 01:19:40,357][__main__][INFO] - Starting iteration 736. [2026-03-26 01:19:40,362][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 01:19:40,363][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:19:45,249][__main__][INFO] - Number of regex retries in iteration 736: 0 [2026-03-26 01:19:45,250][__main__][INFO] - agents played in iteration 736 are Bob, Alice [2026-03-26 01:19:45,748][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:19:45,809][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:19:45,810][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:19:45,811][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:19:46,466][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:19:47,082][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:19:47,744][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:19:48,402][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:19:49,062][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:19:49,721][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:19:50,380][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:19:51,039][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:19:51,698][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:19:52,357][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:19:53,016][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:19:53,675][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:19:54,334][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:19:54,994][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:19:55,653][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:19:56,312][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:19:56,971][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:19:57,631][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:19:58,291][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:19:58,954][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:19:59,614][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:20:00,273][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:20:00,933][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:20:01,592][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:20:02,251][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:20:02,911][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:20:03,571][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:20:04,230][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:20:04,889][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:20:05,548][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:20:06,207][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:20:06,867][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:20:07,526][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:20:08,185][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:20:08,846][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:20:09,505][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:20:10,165][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:20:10,824][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:20:11,484][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:20:12,144][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:20:12,803][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:20:13,463][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:20:14,123][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:20:14,783][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:20:15,443][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:20:16,102][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:20:16,761][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:20:17,421][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:20:18,331][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:20:18,992][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:20:19,651][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:20:20,310][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:20:20,969][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:20:21,628][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:20:22,287][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:20:22,946][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:20:23,604][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:20:24,263][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:20:24,922][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:20:25,580][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:20:26,239][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:20:26,898][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:20:27,556][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:20:28,217][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:20:28,876][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:20:29,609][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 01:20:30,931][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:20:30,936][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:20:30,937][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:20:32,560][__main__][INFO] - Iteration 737 took 52s (9.36% Gen, 87.52% Train). Generation: 4s, Training: 45s. Estimated remaining time: 3h 15m 33s. Estimated total time: 14h 30m 0s. Time estimates for 10 more iterations: 8m 42s, 100 more iterations: 1h 27m 0s, 500 more iterations: 7h 15m 0s. [2026-03-26 01:20:32,562][__main__][INFO] - Starting iteration 737. [2026-03-26 01:20:32,566][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 01:20:32,567][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:20:37,672][__main__][INFO] - Number of regex retries in iteration 737: 0 [2026-03-26 01:20:37,673][__main__][INFO] - agents played in iteration 737 are Bob, Alice [2026-03-26 01:20:38,542][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:20:38,604][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:20:38,605][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:20:38,606][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:20:39,357][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:20:40,002][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:20:40,659][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:20:41,316][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:20:41,973][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:20:42,630][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:20:43,287][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:20:43,944][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:20:44,601][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:20:45,258][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:20:45,915][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:20:46,572][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:20:47,230][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:20:47,887][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:20:48,544][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:20:49,202][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:20:49,859][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:20:50,516][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:20:51,174][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:20:51,831][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:20:52,488][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:20:53,146][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:20:53,803][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:20:54,460][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:20:55,118][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:20:55,775][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:20:56,434][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:20:57,092][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:20:57,751][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:20:58,408][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:20:59,066][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:20:59,725][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:21:00,383][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:21:01,041][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:21:01,699][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:21:02,358][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:21:03,015][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:21:03,674][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:21:04,332][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:21:04,989][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:21:05,647][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:21:06,306][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:21:06,963][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:21:07,621][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:21:08,279][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:21:08,936][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:21:09,593][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:21:10,251][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:21:11,241][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:21:11,900][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:21:12,559][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:21:13,216][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:21:13,874][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:21:14,531][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:21:15,189][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:21:15,847][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:21:16,505][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:21:17,162][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:21:17,821][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:21:18,478][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:21:19,135][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:21:19,793][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:21:20,450][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:21:21,107][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:21:21,766][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:21:22,482][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 01:21:23,805][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:21:23,808][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:21:23,810][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:21:25,281][__main__][INFO] - Iteration 738 took 52s (9.69% Gen, 87.52% Train). Generation: 5s, Training: 46s. Estimated remaining time: 3h 23m 17s. Estimated total time: 14h 38m 37s. Time estimates for 10 more iterations: 8m 47s, 100 more iterations: 1h 27m 51s, 500 more iterations: 7h 19m 18s. [2026-03-26 01:21:25,284][__main__][INFO] - Starting iteration 738. [2026-03-26 01:21:25,289][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 01:21:25,290][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:21:30,593][__main__][INFO] - Number of regex retries in iteration 738: 0 [2026-03-26 01:21:30,595][__main__][INFO] - agents played in iteration 738 are Bob, Alice [2026-03-26 01:21:31,210][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:21:31,272][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:21:31,273][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:21:31,274][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:21:31,939][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:21:32,605][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:21:33,264][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:21:33,920][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:21:34,577][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:21:35,234][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:21:35,891][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:21:36,548][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:21:37,205][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:21:37,862][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:21:38,519][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:21:39,176][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:21:39,835][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:21:40,492][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:21:41,149][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:21:41,807][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:21:42,465][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:21:43,122][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:21:43,780][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:21:44,438][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:21:45,095][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:21:45,752][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:21:46,411][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:21:47,068][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:21:47,726][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:21:48,384][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:21:49,042][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:21:49,699][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:21:50,358][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:21:51,015][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:21:51,673][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:21:52,331][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:21:52,988][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:21:53,645][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:21:54,303][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:21:54,960][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:21:55,617][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:21:56,274][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:21:56,932][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:21:57,589][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:21:58,247][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:21:58,905][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:21:59,562][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:22:00,219][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:22:00,877][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:22:01,534][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:22:02,191][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:22:02,848][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:22:03,772][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:22:04,429][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:22:05,087][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:22:05,745][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:22:06,402][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:22:07,060][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:22:07,717][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:22:08,374][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:22:09,031][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:22:09,690][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:22:10,347][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:22:11,005][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:22:11,662][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:22:12,319][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:22:12,977][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:22:13,635][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:22:14,292][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:22:15,109][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 01:22:16,460][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:22:16,463][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:22:16,465][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:22:17,871][__main__][INFO] - Iteration 739 took 52s (10.09% Gen, 87.23% Train). Generation: 5s, Training: 45s. Estimated remaining time: 3h 20m 12s. Estimated total time: 14h 36m 25s. Time estimates for 10 more iterations: 8m 45s, 100 more iterations: 1h 27m 38s, 500 more iterations: 7h 18m 12s. [2026-03-26 01:22:17,874][__main__][INFO] - Starting iteration 739. [2026-03-26 01:22:17,878][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 01:22:17,878][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:22:22,809][__main__][INFO] - Number of regex retries in iteration 739: 0 [2026-03-26 01:22:22,810][__main__][INFO] - agents played in iteration 739 are Bob, Alice [2026-03-26 01:22:23,496][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:22:23,558][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:22:23,559][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:22:23,560][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:22:24,252][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:22:24,863][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:22:25,522][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:22:26,178][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:22:26,836][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:22:27,494][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:22:28,152][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:22:28,809][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:22:29,467][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:22:30,124][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:22:30,782][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:22:31,439][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:22:32,097][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:22:32,755][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:22:33,412][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:22:34,070][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:22:34,729][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:22:35,386][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:22:36,044][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:22:36,702][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:22:37,359][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:22:38,016][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:22:38,674][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:22:39,331][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:22:39,988][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:22:40,646][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:22:41,303][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:22:41,960][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:22:42,618][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:22:43,275][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:22:43,932][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:22:44,590][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:22:45,247][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:22:45,904][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:22:46,562][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:22:47,219][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:22:47,876][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:22:48,533][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:22:49,190][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:22:49,848][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:22:50,505][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:22:51,162][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:22:51,820][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:22:52,477][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:22:53,135][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:22:53,792][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:22:54,449][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:22:55,107][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:22:56,007][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:22:56,664][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:22:57,322][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:22:57,980][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:22:58,638][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:22:59,296][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:22:59,953][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:23:00,610][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:23:01,268][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:23:01,926][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:23:02,583][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:23:03,241][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:23:03,898][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:23:04,555][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:23:05,213][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:23:05,870][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:23:06,528][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:23:07,219][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:42 [2026-03-26 01:23:08,595][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:23:08,598][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:23:08,599][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:23:09,993][__main__][INFO] - Iteration 740 took 52s (9.46% Gen, 87.86% Train). Generation: 4s, Training: 45s. Estimated remaining time: 3h 11m 32s. Estimated total time: 14h 28m 37s. Time estimates for 10 more iterations: 8m 41s, 100 more iterations: 1h 26m 51s, 500 more iterations: 7h 14m 18s. [2026-03-26 01:23:09,997][__main__][INFO] - Starting iteration 740. [2026-03-26 01:23:10,002][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 01:23:10,003][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:23:15,993][__main__][INFO] - Number of regex retries in iteration 740: 0 [2026-03-26 01:23:15,995][__main__][INFO] - agents played in iteration 740 are Bob, Alice [2026-03-26 01:23:17,211][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:23:17,272][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:23:17,273][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:23:17,274][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:23:17,985][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:23:18,746][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:23:19,404][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:23:21,459][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:23:22,117][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:23:22,774][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:23:23,431][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:23:24,088][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:23:24,745][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:23:25,402][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:23:26,059][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:23:26,716][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:23:27,374][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:23:28,031][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:23:28,688][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:23:29,346][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:23:30,003][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:23:30,659][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:23:31,317][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:23:31,974][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:23:32,631][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:23:33,288][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:23:33,945][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:23:34,602][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:23:35,261][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:23:35,918][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:23:36,577][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:23:37,235][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:23:37,892][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:23:38,549][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:23:39,206][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:23:39,863][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:23:40,521][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:23:41,178][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:23:41,836][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:23:42,493][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:23:43,150][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:23:43,807][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:23:44,465][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:23:45,123][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:23:45,780][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:23:46,437][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:23:47,097][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:23:47,755][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:23:48,412][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:23:49,069][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:23:49,727][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:23:50,384][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:23:51,377][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:23:52,035][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:23:52,692][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:23:53,350][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:23:54,007][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:23:54,663][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:23:55,321][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:23:55,979][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:23:56,638][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:23:57,295][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:23:57,952][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:23:58,610][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:23:59,268][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:23:59,926][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:24:00,584][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:24:01,242][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:24:01,899][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:24:02,717][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:44 [2026-03-26 01:24:04,042][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:24:04,045][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:24:04,046][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:24:05,489][__main__][INFO] - Iteration 741 took 55s (10.80% Gen, 86.60% Train). Generation: 5s, Training: 48s. Estimated remaining time: 4h 6m 49s. Estimated total time: 15h 24m 49s. Time estimates for 10 more iterations: 9m 14s, 100 more iterations: 1h 32m 28s, 500 more iterations: 7h 42m 24s. [2026-03-26 01:24:05,492][__main__][INFO] - Starting iteration 741. [2026-03-26 01:24:05,497][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 01:24:05,498][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:24:12,245][__main__][INFO] - Number of regex retries in iteration 741: 0 [2026-03-26 01:24:12,246][__main__][INFO] - agents played in iteration 741 are Bob, Alice [2026-03-26 01:24:13,513][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:24:13,575][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:24:13,576][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:24:13,576][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:24:14,245][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:24:14,866][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:24:15,520][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:24:16,177][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:24:16,835][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:24:17,493][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:24:18,150][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:24:18,808][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:24:19,466][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:24:20,123][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:24:20,781][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:24:21,439][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:24:22,097][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:24:22,755][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:24:23,413][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:24:24,072][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:24:24,730][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:24:25,389][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:24:26,046][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:24:26,705][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:24:27,363][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:24:28,021][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:24:28,680][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:24:29,339][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:24:29,996][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:24:30,654][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:24:31,312][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:24:31,971][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:24:32,629][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:24:33,289][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:24:33,947][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:24:34,606][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:24:35,264][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:24:35,923][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:24:36,581][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:24:37,240][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:24:37,898][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:24:38,556][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:24:39,214][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:24:39,873][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:24:40,531][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:24:41,189][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:24:41,847][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:24:42,506][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:24:43,165][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:24:43,823][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:24:44,481][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:24:45,140][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:24:46,134][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:24:46,794][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:24:47,451][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:24:48,109][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:24:48,767][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:24:49,424][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:24:50,082][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:24:50,741][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:24:51,397][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:24:52,055][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:24:52,713][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:24:53,371][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:24:54,028][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:24:54,686][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:24:55,343][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:24:56,001][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:24:56,658][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:24:57,379][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 01:24:58,721][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:24:58,723][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:24:58,725][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:25:00,178][__main__][INFO] - Iteration 742 took 54s (12.34% Gen, 85.00% Train). Generation: 6s, Training: 46s. Estimated remaining time: 3h 52m 28s. Estimated total time: 15h 11m 22s. Time estimates for 10 more iterations: 9m 6s, 100 more iterations: 1h 31m 8s, 500 more iterations: 7h 35m 41s. [2026-03-26 01:25:00,181][__main__][INFO] - Starting iteration 742. [2026-03-26 01:25:00,185][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 01:25:00,186][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:25:05,830][__main__][INFO] - Number of regex retries in iteration 742: 0 [2026-03-26 01:25:05,831][__main__][INFO] - agents played in iteration 742 are Bob, Alice [2026-03-26 01:25:06,428][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:25:06,491][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:25:06,491][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:25:06,492][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:25:07,193][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:25:07,811][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:25:08,470][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:25:09,126][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:25:09,783][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:25:10,440][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:25:11,097][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:25:11,754][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:25:12,411][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:25:13,069][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:25:13,726][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:25:14,383][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:25:15,040][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:25:15,697][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:25:16,353][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:25:17,010][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:25:17,667][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:25:18,324][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:25:18,983][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:25:19,639][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:25:20,296][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:25:20,953][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:25:21,611][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:25:22,269][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:25:22,926][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:25:23,584][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:25:24,241][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:25:24,898][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:25:25,555][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:25:26,213][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:25:26,870][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:25:27,528][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:25:28,185][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:25:28,842][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:25:29,500][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:25:30,157][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:25:30,814][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:25:31,472][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:25:32,129][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:25:32,787][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:25:33,445][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:25:34,102][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:25:34,760][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:25:35,418][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:25:36,076][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:25:36,733][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:25:37,391][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:25:38,049][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:25:39,018][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:25:39,675][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:25:40,332][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:25:40,990][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:25:41,648][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:25:42,306][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:25:42,964][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:25:43,621][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:25:44,278][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:25:44,936][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:25:45,593][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:25:46,251][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:25:46,909][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:25:47,566][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:25:48,223][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:25:48,881][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:25:49,538][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:25:50,266][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 01:25:51,597][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:25:51,600][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:25:51,601][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:25:53,010][__main__][INFO] - Iteration 743 took 52s (10.69% Gen, 86.64% Train). Generation: 5s, Training: 45s. Estimated remaining time: 3h 20m 38s. Estimated total time: 14h 40m 26s. Time estimates for 10 more iterations: 8m 48s, 100 more iterations: 1h 28m 2s, 500 more iterations: 7h 20m 13s. [2026-03-26 01:25:53,013][__main__][INFO] - Starting iteration 743. [2026-03-26 01:25:53,017][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 01:25:53,018][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:25:58,084][__main__][INFO] - Number of regex retries in iteration 743: 0 [2026-03-26 01:25:58,085][__main__][INFO] - agents played in iteration 743 are Bob, Alice [2026-03-26 01:25:58,688][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:25:58,753][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:25:58,754][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:25:58,754][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:25:59,447][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:26:00,062][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:26:00,721][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:26:01,378][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:26:02,035][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:26:02,692][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:26:03,349][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:26:04,006][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:26:04,663][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:26:05,320][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:26:05,977][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:26:06,634][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:26:07,291][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:26:07,948][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:26:08,605][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:26:09,262][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:26:09,921][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:26:10,578][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:26:11,236][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:26:11,893][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:26:12,550][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:26:13,208][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:26:13,865][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:26:14,522][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:26:15,179][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:26:15,836][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:26:16,494][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:26:17,153][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:26:17,810][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:26:18,468][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:26:19,126][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:26:19,784][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:26:20,441][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:26:21,099][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:26:21,758][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:26:22,416][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:26:23,073][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:26:23,731][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:26:24,389][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:26:25,047][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:26:25,705][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:26:26,363][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:26:27,020][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:26:27,678][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:26:28,336][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:26:28,994][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:26:29,651][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:26:30,309][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:26:31,286][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:26:31,944][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:26:32,601][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:26:33,263][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:26:33,922][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:26:34,581][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:26:35,239][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:26:35,897][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:26:36,556][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:26:37,215][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:26:37,874][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:26:38,532][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:26:39,190][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:26:39,848][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:26:40,507][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:26:41,165][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:26:41,824][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:26:42,561][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 01:26:44,023][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:26:44,027][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:26:44,048][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:26:45,730][__main__][INFO] - Iteration 744 took 52s (9.61% Gen, 87.19% Train). Generation: 5s, Training: 45s. Estimated remaining time: 3h 17m 54s. Estimated total time: 14h 38m 34s. Time estimates for 10 more iterations: 8m 47s, 100 more iterations: 1h 27m 51s, 500 more iterations: 7h 19m 17s. [2026-03-26 01:26:45,736][__main__][INFO] - Starting iteration 744. [2026-03-26 01:26:45,761][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 01:26:45,762][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:26:51,476][__main__][INFO] - Number of regex retries in iteration 744: 0 [2026-03-26 01:26:51,477][__main__][INFO] - agents played in iteration 744 are Bob, Alice [2026-03-26 01:26:52,144][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:26:52,206][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:26:52,206][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:26:52,207][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:26:52,921][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:26:53,573][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:26:54,231][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:26:54,888][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:26:55,545][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:26:56,201][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:26:56,858][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:26:57,515][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:26:58,173][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:26:58,830][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:26:59,487][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:27:00,144][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:27:00,802][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:27:01,459][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:27:02,115][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:27:02,773][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:27:03,435][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:27:04,094][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:27:04,752][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:27:05,410][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:27:06,069][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:27:06,727][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:27:07,385][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:27:08,044][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:27:08,702][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:27:09,361][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:27:10,019][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:27:10,679][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:27:11,337][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:27:11,996][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:27:12,655][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:27:13,314][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:27:13,972][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:27:14,630][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:27:15,288][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:27:15,947][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:27:16,605][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:27:17,263][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:27:17,922][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:27:18,580][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:27:19,239][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:27:19,897][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:27:20,555][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:27:21,214][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:27:21,872][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:27:22,530][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:27:23,189][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:27:23,848][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:27:24,757][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:27:25,418][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:27:26,076][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:27:26,739][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:27:27,394][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:27:28,052][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:27:28,711][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:27:29,369][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:27:30,028][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:27:30,687][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:27:31,345][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:27:32,004][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:27:32,662][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:27:33,320][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:27:33,979][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:27:34,638][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:27:35,296][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:27:36,033][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 01:27:37,403][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:27:37,405][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:27:37,407][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:27:38,818][__main__][INFO] - Iteration 745 took 53s (10.77% Gen, 86.56% Train). Generation: 5s, Training: 45s. Estimated remaining time: 3h 22m 46s. Estimated total time: 14h 44m 19s. Time estimates for 10 more iterations: 8m 50s, 100 more iterations: 1h 28m 25s, 500 more iterations: 7h 22m 9s. [2026-03-26 01:27:38,820][__main__][INFO] - Starting iteration 745. [2026-03-26 01:27:38,824][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 01:27:38,825][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:27:45,345][__main__][INFO] - Number of regex retries in iteration 745: 0 [2026-03-26 01:27:45,346][__main__][INFO] - agents played in iteration 745 are Bob, Alice [2026-03-26 01:27:45,944][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:27:46,007][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:27:46,008][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:27:46,008][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:27:46,686][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:27:47,295][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:27:47,954][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:27:48,611][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:27:49,268][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:27:49,925][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:27:50,581][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:27:51,239][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:27:51,898][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:27:52,555][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:27:53,212][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:27:53,869][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:27:54,526][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:27:55,183][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:27:55,841][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:27:56,498][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:27:57,156][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:27:57,814][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:27:58,472][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:27:59,129][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:27:59,786][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:28:00,443][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:28:01,100][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:28:01,758][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:28:02,415][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:28:03,072][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:28:03,729][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:28:04,386][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:28:05,044][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:28:05,701][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:28:06,358][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:28:07,015][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:28:07,672][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:28:08,329][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:28:08,986][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:28:09,644][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:28:10,301][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:28:10,959][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:28:11,617][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:28:12,274][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:28:12,931][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:28:13,588][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:28:14,245][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:28:14,903][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:28:15,560][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:28:16,217][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:28:16,875][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:28:17,534][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:28:18,501][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:28:19,161][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:28:19,819][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:28:20,478][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:28:21,136][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:28:21,795][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:28:22,454][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:28:23,113][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:28:23,772][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:28:24,430][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:28:25,089][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:28:25,748][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:28:26,406][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:28:27,065][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:28:27,723][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:28:28,383][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:28:29,042][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:28:29,760][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 01:28:31,194][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:28:31,197][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:28:31,198][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:28:32,565][__main__][INFO] - Iteration 746 took 53s (12.08% Gen, 85.32% Train). Generation: 6s, Training: 45s. Estimated remaining time: 3h 33m 15s. Estimated total time: 14h 55m 42s. Time estimates for 10 more iterations: 8m 57s, 100 more iterations: 1h 29m 34s, 500 more iterations: 7h 27m 51s. [2026-03-26 01:28:32,568][__main__][INFO] - Starting iteration 746. [2026-03-26 01:28:32,572][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 01:28:32,573][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:28:42,466][__main__][INFO] - Number of regex retries in iteration 746: 0 [2026-03-26 01:28:42,467][__main__][INFO] - agents played in iteration 746 are Bob, Alice [2026-03-26 01:28:43,648][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:28:43,710][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:28:43,711][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:28:43,712][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:28:44,368][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:28:44,995][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:28:45,654][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:28:46,312][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:28:46,969][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:28:47,627][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:28:48,285][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:28:48,943][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:28:49,601][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:28:50,260][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:28:50,918][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:28:51,576][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:28:52,234][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:28:52,892][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:28:53,551][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:28:54,209][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:28:54,867][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:28:55,526][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:28:56,184][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:28:56,842][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:28:57,501][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:28:58,159][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:28:58,819][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:28:59,477][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:29:00,135][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:29:00,794][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:29:01,453][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:29:02,111][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:29:02,769][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:29:03,428][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:29:04,086][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:29:04,745][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:29:05,404][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:29:06,063][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:29:06,722][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:29:07,380][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:29:08,039][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:29:08,698][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:29:09,356][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:29:10,015][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:29:10,674][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:29:11,333][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:29:11,991][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:29:12,650][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:29:13,309][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:29:13,967][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:29:14,626][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:29:15,285][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:29:16,185][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:29:16,845][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:29:17,504][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:29:18,163][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:29:18,822][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:29:19,914][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:29:20,572][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:29:21,231][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:29:21,890][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:29:22,549][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:29:23,207][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:29:23,866][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:29:24,525][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:29:25,185][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:29:25,843][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:29:26,501][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:29:27,160][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:29:27,970][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 01:29:29,302][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:29:29,305][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:29:29,306][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:29:30,931][__main__][INFO] - Iteration 747 took 58s (16.95% Gen, 80.26% Train). Generation: 9s, Training: 46s. Estimated remaining time: 4h 49m 14s. Estimated total time: 16h 12m 40s. Time estimates for 10 more iterations: 9m 43s, 100 more iterations: 1h 37m 16s, 500 more iterations: 8h 6m 20s. [2026-03-26 01:29:30,933][__main__][INFO] - Starting iteration 747. [2026-03-26 01:29:30,937][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 01:29:30,938][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:29:39,861][__main__][INFO] - Number of regex retries in iteration 747: 0 [2026-03-26 01:29:39,862][__main__][INFO] - agents played in iteration 747 are Bob, Alice [2026-03-26 01:29:40,514][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:29:40,575][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:29:40,576][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:29:40,577][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:29:41,242][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:29:41,857][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:29:42,517][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:29:43,176][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:29:43,834][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:29:44,492][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:29:45,152][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:29:45,810][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:29:46,469][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:29:47,127][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:29:47,785][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:29:48,444][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:29:49,102][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:29:49,760][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:29:50,418][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:29:51,076][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:29:51,734][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:29:52,391][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:29:53,049][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:29:53,706][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:29:54,363][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:29:55,021][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:29:55,678][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:29:56,335][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:29:56,993][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:29:57,651][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:29:58,309][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:29:58,967][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:29:59,624][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:30:00,282][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:30:00,940][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:30:01,597][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:30:02,254][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:30:02,912][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:30:03,569][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:30:04,227][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:30:04,885][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:30:05,542][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:30:06,200][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:30:06,858][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:30:07,516][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:30:08,173][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:30:08,831][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:30:09,488][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:30:10,146][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:30:10,804][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:30:11,462][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:30:12,121][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:30:13,083][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:30:13,742][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:30:14,399][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:30:15,059][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:30:15,716][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:30:16,374][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:30:17,032][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:30:17,690][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:30:18,349][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:30:19,009][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:30:19,666][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:30:20,324][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:30:20,982][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:30:21,640][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:30:22,298][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:30:22,955][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:30:23,613][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:30:24,327][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 01:30:25,674][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:30:25,677][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:30:25,678][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:30:27,051][__main__][INFO] - Iteration 748 took 56s (15.90% Gen, 81.64% Train). Generation: 8s, Training: 45s. Estimated remaining time: 4h 10m 54s. Estimated total time: 15h 35m 15s. Time estimates for 10 more iterations: 9m 21s, 100 more iterations: 1h 33m 31s, 500 more iterations: 7h 47m 37s. [2026-03-26 01:30:27,054][__main__][INFO] - Starting iteration 748. [2026-03-26 01:30:27,058][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 01:30:27,058][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:30:32,138][__main__][INFO] - Number of regex retries in iteration 748: 0 [2026-03-26 01:30:32,139][__main__][INFO] - agents played in iteration 748 are Bob, Alice [2026-03-26 01:30:32,643][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:30:32,706][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:30:32,707][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:30:32,707][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:30:33,388][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:30:34,002][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:30:34,661][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:30:35,318][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:30:35,975][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:30:36,632][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:30:37,289][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:30:37,946][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:30:38,603][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:30:39,260][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:30:39,918][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:30:40,575][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:30:41,232][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:30:41,890][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:30:42,547][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:30:43,204][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:30:43,862][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:30:44,519][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:30:45,176][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:30:45,833][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:30:46,490][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:30:47,147][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:30:47,805][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:30:48,462][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:30:49,119][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:30:49,777][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:30:50,434][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:30:51,091][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:30:51,750][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:30:52,407][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:30:53,065][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:30:53,723][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:30:54,380][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:30:55,037][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:30:55,695][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:30:56,353][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:30:57,011][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:30:57,669][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:30:58,327][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:30:58,985][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:30:59,644][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:31:00,301][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:31:00,959][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:31:01,616][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:31:02,274][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:31:02,932][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:31:03,589][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:31:04,247][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:31:05,139][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:31:05,797][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:31:06,455][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:31:07,112][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:31:07,771][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:31:08,428][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:31:09,085][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:31:09,744][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:31:10,401][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:31:11,059][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:31:11,717][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:31:12,374][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:31:13,032][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:31:13,690][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:31:14,347][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:31:15,005][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:31:15,663][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:31:16,356][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:42 [2026-03-26 01:31:17,735][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:31:17,738][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:31:17,739][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:31:19,118][__main__][INFO] - Iteration 749 took 52s (9.76% Gen, 87.59% Train). Generation: 5s, Training: 45s. Estimated remaining time: 3h 2m 28s. Estimated total time: 14h 27m 41s. Time estimates for 10 more iterations: 8m 40s, 100 more iterations: 1h 26m 46s, 500 more iterations: 7h 13m 50s. [2026-03-26 01:31:19,120][__main__][INFO] - Starting iteration 749. [2026-03-26 01:31:19,127][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 01:31:19,128][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:31:24,077][__main__][INFO] - Number of regex retries in iteration 749: 0 [2026-03-26 01:31:24,079][__main__][INFO] - agents played in iteration 749 are Bob, Alice [2026-03-26 01:31:24,694][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:31:24,755][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:31:24,756][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:31:24,757][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:31:25,417][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:31:26,021][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:31:26,679][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:31:27,336][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:31:27,994][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:31:28,651][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:31:29,308][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:31:29,966][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:31:30,623][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:31:31,280][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:31:31,938][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:31:32,595][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:31:33,253][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:31:33,911][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:31:34,568][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:31:35,225][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:31:35,883][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:31:36,541][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:31:37,198][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:31:37,856][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:31:38,513][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:31:39,170][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:31:39,828][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:31:40,486][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:31:41,144][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:31:41,802][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:31:42,460][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:31:43,118][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:31:43,776][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:31:44,435][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:31:45,092][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:31:45,750][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:31:46,408][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:31:47,065][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:31:47,722][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:31:48,380][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:31:49,039][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:31:49,696][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:31:50,354][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:31:51,011][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:31:51,669][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:31:52,326][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:31:52,984][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:31:53,641][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:31:54,299][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:31:54,956][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:31:55,615][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:31:56,273][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:31:57,239][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:31:57,897][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:31:58,555][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:31:59,213][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:31:59,870][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:32:00,527][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:32:01,186][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:32:01,843][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:32:02,500][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:32:03,158][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:32:03,816][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:32:04,473][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:32:05,132][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:32:05,789][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:32:06,446][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:32:07,105][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:32:07,762][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:32:08,471][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 01:32:09,802][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:32:09,807][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:32:09,808][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:32:11,118][__main__][INFO] - Iteration 750 took 51s (9.52% Gen, 87.95% Train). Generation: 4s, Training: 45s. Estimated remaining time: 3h 0m 30s. Estimated total time: 14h 26m 35s. Time estimates for 10 more iterations: 8m 39s, 100 more iterations: 1h 26m 39s, 500 more iterations: 7h 13m 17s. [2026-03-26 01:32:11,122][__main__][INFO] - Starting iteration 750. [2026-03-26 01:32:11,128][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 01:32:11,128][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:32:16,372][__main__][INFO] - Number of regex retries in iteration 750: 0 [2026-03-26 01:32:16,374][__main__][INFO] - agents played in iteration 750 are Bob, Alice [2026-03-26 01:32:16,919][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:32:16,982][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:32:16,983][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:32:16,984][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:32:17,676][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:32:18,296][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:32:18,955][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:32:19,613][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:32:20,271][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:32:20,929][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:32:21,587][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:32:22,245][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:32:22,904][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:32:23,627][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:32:24,283][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:32:24,941][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:32:25,601][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:32:26,260][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:32:26,918][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:32:27,578][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:32:28,236][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:32:28,897][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:32:29,555][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:32:30,214][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:32:30,873][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:32:31,531][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:32:32,189][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:32:32,848][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:32:33,506][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:32:34,164][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:32:34,823][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:32:35,482][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:32:36,140][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:32:36,799][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:32:37,457][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:32:38,116][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:32:38,774][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:32:39,433][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:32:40,093][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:32:40,751][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:32:41,409][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:32:42,068][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:32:42,726][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:32:43,384][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:32:44,042][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:32:44,701][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:32:45,359][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:32:46,017][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:32:46,675][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:32:47,334][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:32:47,992][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:32:48,650][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:32:49,652][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:32:50,310][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:32:50,967][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:32:51,625][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:32:52,283][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:32:52,940][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:32:53,601][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:32:54,259][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:32:54,916][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:32:55,573][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:32:56,232][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:32:56,890][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:32:57,547][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:32:58,205][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:32:58,863][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:32:59,520][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:33:00,179][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:33:00,958][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 01:33:02,333][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:33:02,337][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:33:02,338][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:33:04,903][__main__][INFO] - Iteration 751 took 53s (9.75% Gen, 85.47% Train). Generation: 5s, Training: 45s. Estimated remaining time: 3h 29m 18s. Estimated total time: 14h 56m 17s. Time estimates for 10 more iterations: 8m 57s, 100 more iterations: 1h 29m 37s, 500 more iterations: 7h 28m 8s. [2026-03-26 01:33:04,906][__main__][INFO] - Starting iteration 751. [2026-03-26 01:33:04,915][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2026-03-26 01:33:04,916][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:33:09,788][__main__][INFO] - Number of regex retries in iteration 751: 0 [2026-03-26 01:33:09,790][__main__][INFO] - agents played in iteration 751 are Bob, Alice [2026-03-26 01:33:10,402][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:33:10,463][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:33:10,464][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:33:10,465][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:33:11,124][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:33:11,748][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:33:12,408][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:33:13,067][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:33:13,724][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:33:14,382][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:33:15,040][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:33:15,698][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:33:16,356][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:33:17,014][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:33:17,671][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:33:18,330][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:33:18,988][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:33:19,646][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:33:20,304][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:33:20,963][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:33:21,621][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:33:22,279][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:33:22,938][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:33:23,597][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:33:24,255][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:33:24,913][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:33:25,571][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:33:26,229][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:33:26,887][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:33:27,546][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:33:28,205][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:33:28,863][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:33:29,521][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:33:30,180][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:33:30,838][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:33:31,496][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:33:32,154][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:33:32,812][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:33:33,470][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:33:34,129][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:33:34,787][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:33:35,446][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:33:36,105][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:33:36,763][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:33:37,422][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:33:38,081][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:33:38,739][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:33:39,398][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:33:40,057][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:33:40,715][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:33:41,373][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:33:42,032][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:33:42,983][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:33:43,641][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:33:44,298][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:33:44,955][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:33:45,613][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:33:46,270][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:33:46,928][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:33:47,585][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:33:48,243][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:33:48,900][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:33:49,558][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:33:50,215][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:33:50,873][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:33:51,530][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:33:52,188][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:33:52,846][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:33:53,504][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:33:54,308][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 01:33:56,328][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:33:56,331][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:33:56,333][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:33:57,711][__main__][INFO] - Iteration 752 took 52s (9.24% Gen, 88.14% Train). Generation: 4s, Training: 46s. Estimated remaining time: 3h 12m 5s. Estimated total time: 14h 39m 58s. Time estimates for 10 more iterations: 8m 47s, 100 more iterations: 1h 27m 59s, 500 more iterations: 7h 19m 59s. [2026-03-26 01:33:57,713][__main__][INFO] - Starting iteration 752. [2026-03-26 01:33:57,718][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2026-03-26 01:33:57,719][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:34:04,098][__main__][INFO] - Number of regex retries in iteration 752: 0 [2026-03-26 01:34:04,099][__main__][INFO] - agents played in iteration 752 are Bob, Alice [2026-03-26 01:34:05,378][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:34:05,440][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:34:05,441][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:34:05,442][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:34:06,111][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:34:06,725][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:34:07,383][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:34:08,040][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:34:08,697][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:34:09,354][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:34:10,011][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:34:10,669][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:34:11,326][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:34:11,983][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:34:12,641][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:34:13,298][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:34:13,955][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:34:14,612][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:34:15,269][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:34:15,926][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:34:16,583][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:34:17,240][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:34:17,898][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:34:18,555][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:34:19,212][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:34:19,869][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:34:20,527][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:34:21,184][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:34:21,841][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:34:22,499][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:34:23,156][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:34:23,813][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:34:24,471][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:34:25,128][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:34:25,785][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:34:26,443][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:34:27,100][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:34:27,757][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:34:28,416][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:34:29,073][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:34:29,731][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:34:30,388][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:34:31,047][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:34:31,704][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:34:32,362][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:34:33,019][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:34:33,677][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:34:34,335][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:34:34,993][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:34:35,650][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:34:36,308][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:34:36,966][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:34:37,878][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:34:38,536][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:34:39,193][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:34:39,851][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:34:40,508][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:34:41,166][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:34:41,824][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:34:42,481][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:34:43,140][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:34:43,798][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:34:44,456][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:34:45,113][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:34:45,771][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:34:46,428][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:34:47,086][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:34:47,746][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:34:48,403][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:34:49,119][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 01:34:50,446][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:34:50,449][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:34:50,450][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:34:51,909][__main__][INFO] - Iteration 753 took 54s (11.77% Gen, 85.53% Train). Generation: 6s, Training: 46s. Estimated remaining time: 3h 34m 26s. Estimated total time: 15h 3m 12s. Time estimates for 10 more iterations: 9m 1s, 100 more iterations: 1h 30m 19s, 500 more iterations: 7h 31m 36s. [2026-03-26 01:34:51,911][__main__][INFO] - Starting iteration 753. [2026-03-26 01:34:51,916][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2026-03-26 01:34:51,916][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:34:58,817][__main__][INFO] - Number of regex retries in iteration 753: 0 [2026-03-26 01:34:58,819][__main__][INFO] - agents played in iteration 753 are Bob, Alice [2026-03-26 01:35:00,166][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:35:00,228][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:35:00,229][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:35:00,229][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:35:00,896][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:35:01,504][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:35:02,162][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:35:02,818][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:35:03,475][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:35:04,132][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:35:04,789][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:35:05,447][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:35:06,104][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:35:06,760][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:35:07,418][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:35:08,074][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:35:08,731][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:35:09,388][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:35:10,046][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:35:10,704][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:35:11,361][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:35:12,019][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:35:13,310][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:35:13,967][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:35:14,624][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:35:15,281][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:35:15,938][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:35:16,595][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:35:17,252][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:35:17,909][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:35:18,566][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:35:19,224][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:35:19,881][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:35:20,538][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:35:21,197][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:35:21,855][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:35:22,513][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:35:23,169][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:35:23,827][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:35:24,484][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:35:25,141][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:35:25,798][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:35:26,456][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:35:27,113][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:35:27,770][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:35:28,427][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:35:29,085][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:35:29,742][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:35:30,399][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:35:31,056][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:35:31,714][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:35:32,371][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:35:33,298][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:35:33,957][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:35:34,614][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:35:35,272][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:35:35,930][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:35:36,587][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:35:37,244][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:35:37,902][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:35:38,559][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:35:39,217][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:35:39,875][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:35:40,532][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:35:41,190][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:35:41,847][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:35:42,505][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:35:43,162][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:35:43,819][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:35:44,548][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 01:35:45,883][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:35:45,886][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:35:45,887][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:35:47,183][__main__][INFO] - Iteration 754 took 55s (12.49% Gen, 85.16% Train). Generation: 6s, Training: 47s. Estimated remaining time: 3h 51m 27s. Estimated total time: 15h 21m 8s. Time estimates for 10 more iterations: 9m 12s, 100 more iterations: 1h 32m 6s, 500 more iterations: 7h 40m 34s. [2026-03-26 01:35:47,185][__main__][INFO] - Starting iteration 754. [2026-03-26 01:35:47,189][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2026-03-26 01:35:47,190][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:35:52,125][__main__][INFO] - Number of regex retries in iteration 754: 0 [2026-03-26 01:35:52,126][__main__][INFO] - agents played in iteration 754 are Bob, Alice [2026-03-26 01:35:52,799][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:35:52,862][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:35:52,863][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:35:52,864][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:35:53,533][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:35:54,149][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:35:54,807][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:35:55,465][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:35:56,122][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:35:56,779][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:35:57,437][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:35:58,093][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:35:58,751][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:35:59,408][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:36:00,065][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:36:00,722][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:36:01,380][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:36:02,037][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:36:02,694][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:36:03,351][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:36:04,008][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:36:04,665][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:36:05,322][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:36:05,980][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:36:06,637][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:36:07,299][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:36:07,957][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:36:08,615][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:36:09,274][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:36:09,932][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:36:10,591][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:36:11,249][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:36:11,908][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:36:12,566][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:36:13,224][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:36:13,883][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:36:14,542][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:36:15,200][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:36:15,858][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:36:16,517][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:36:17,175][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:36:17,834][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:36:18,492][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:36:19,151][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:36:19,811][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:36:20,470][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:36:21,129][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:36:21,788][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:36:22,447][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:36:23,105][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:36:23,764][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:36:24,423][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:36:25,338][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:36:25,999][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:36:26,658][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:36:27,317][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:36:27,976][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:36:28,635][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:36:29,294][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:36:29,953][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:36:30,611][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:36:31,270][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:36:31,928][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:36:32,587][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:36:33,246][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:36:33,905][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:36:34,563][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:36:35,222][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:36:35,880][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:36:36,625][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 01:36:37,991][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:36:37,994][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:36:37,995][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:36:39,371][__main__][INFO] - Iteration 755 took 52s (9.46% Gen, 87.90% Train). Generation: 4s, Training: 45s. Estimated remaining time: 2h 59m 9s. Estimated total time: 14h 29m 43s. Time estimates for 10 more iterations: 8m 41s, 100 more iterations: 1h 26m 58s, 500 more iterations: 7h 14m 51s. [2026-03-26 01:36:39,374][__main__][INFO] - Starting iteration 755. [2026-03-26 01:36:39,377][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2026-03-26 01:36:39,378][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:36:44,949][__main__][INFO] - Number of regex retries in iteration 755: 0 [2026-03-26 01:36:44,951][__main__][INFO] - agents played in iteration 755 are Bob, Alice [2026-03-26 01:36:45,538][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:36:45,600][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:36:45,601][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:36:45,601][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:36:46,277][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:36:46,890][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:36:47,548][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:36:48,204][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:36:48,862][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:36:49,519][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:36:50,175][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:36:50,833][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:36:51,490][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:36:52,147][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:36:52,805][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:36:53,461][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:36:54,119][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:36:54,776][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:36:55,433][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:36:56,090][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:36:56,747][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:36:57,404][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:36:58,061][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:36:58,719][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:36:59,376][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:37:00,033][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:37:00,691][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:37:01,348][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:37:02,005][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:37:02,662][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:37:03,320][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:37:04,981][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:37:05,638][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:37:06,296][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:37:06,953][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:37:07,611][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:37:08,269][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:37:08,926][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:37:09,583][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:37:10,241][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:37:10,899][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:37:11,557][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:37:12,215][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:37:12,872][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:37:13,530][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:37:14,188][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:37:14,846][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:37:15,503][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:37:16,161][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:37:16,818][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:37:17,475][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:37:18,133][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:37:19,056][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:37:19,714][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:37:20,372][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:37:21,029][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:37:21,687][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:37:22,345][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:37:23,002][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:37:23,659][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:37:24,317][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:37:24,975][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:37:25,632][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:37:26,289][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:37:26,947][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:37:27,605][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:37:28,262][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:37:28,920][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:37:29,578][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:37:30,288][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:44 [2026-03-26 01:37:31,652][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:37:31,655][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:37:31,657][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:37:33,282][__main__][INFO] - Iteration 756 took 53s (10.34% Gen, 86.64% Train). Generation: 5s, Training: 46s. Estimated remaining time: 3h 26m 58s. Estimated total time: 14h 58m 26s. Time estimates for 10 more iterations: 8m 59s, 100 more iterations: 1h 29m 50s, 500 more iterations: 7h 29m 13s. [2026-03-26 01:37:33,285][__main__][INFO] - Starting iteration 756. [2026-03-26 01:37:33,292][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2026-03-26 01:37:33,292][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:37:38,526][__main__][INFO] - Number of regex retries in iteration 756: 0 [2026-03-26 01:37:38,527][__main__][INFO] - agents played in iteration 756 are Bob, Alice [2026-03-26 01:37:39,374][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:37:39,443][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:37:39,444][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:37:39,445][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:37:40,189][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:37:41,772][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:37:42,430][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:37:43,087][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:37:43,744][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:37:44,401][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:37:45,058][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:37:45,715][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:37:46,372][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:37:47,029][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:37:47,686][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:37:48,343][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:37:49,001][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:37:49,658][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:37:50,316][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:37:50,974][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:37:51,633][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:37:52,291][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:37:52,948][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:37:53,606][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:37:54,264][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:37:54,923][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:37:55,581][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:37:56,239][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:37:56,898][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:37:57,555][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:37:58,213][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:37:58,871][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:37:59,528][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:38:00,186][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:38:02,158][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:38:02,816][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:38:03,473][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:38:04,130][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:38:04,787][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:38:05,445][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:38:06,102][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:38:06,759][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:38:07,417][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:38:08,074][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:38:08,731][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:38:09,389][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:38:10,047][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:38:10,704][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:38:11,363][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:38:12,020][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:38:12,678][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:38:13,335][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:38:14,331][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:38:14,989][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:38:15,647][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:38:16,304][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:38:16,962][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:38:17,620][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:38:18,277][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:38:18,935][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:38:19,593][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:38:20,250][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:38:20,908][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:38:21,568][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:38:22,226][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:38:22,884][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:38:23,542][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:38:24,200][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:38:24,857][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:38:25,623][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:45 [2026-03-26 01:38:26,974][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:38:26,977][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:38:26,978][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:38:28,384][__main__][INFO] - Iteration 757 took 55s (9.50% Gen, 87.94% Train). Generation: 5s, Training: 48s. Estimated remaining time: 3h 45m 50s. Estimated total time: 15h 18m 13s. Time estimates for 10 more iterations: 9m 10s, 100 more iterations: 1h 31m 49s, 500 more iterations: 7h 39m 6s. [2026-03-26 01:38:28,387][__main__][INFO] - Starting iteration 757. [2026-03-26 01:38:28,391][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2026-03-26 01:38:28,392][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:38:33,327][__main__][INFO] - Number of regex retries in iteration 757: 0 [2026-03-26 01:38:33,329][__main__][INFO] - agents played in iteration 757 are Bob, Alice [2026-03-26 01:38:33,957][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:38:34,020][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:38:34,021][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:38:34,022][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:38:34,692][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:38:35,333][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:38:35,992][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:38:36,649][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:38:37,307][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:38:37,965][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:38:38,622][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:38:39,279][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:38:39,938][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:38:40,595][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:38:41,253][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:38:41,910][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:38:42,568][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:38:43,225][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:38:43,883][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:38:44,541][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:38:45,198][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:38:45,856][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:38:46,513][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:38:47,170][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:38:47,828][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:38:48,486][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:38:49,143][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:38:49,801][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:38:50,458][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:38:51,115][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:38:51,773][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:38:52,431][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:38:53,088][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:38:53,746][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:38:54,404][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:38:55,061][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:38:55,719][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:38:56,376][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:38:57,035][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:38:57,692][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:38:58,349][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:38:59,007][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:38:59,664][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:39:00,322][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:39:00,979][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:39:01,636][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:39:02,294][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:39:02,951][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:39:03,609][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:39:04,267][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:39:04,924][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:39:05,583][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:39:06,539][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:39:07,198][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:39:07,856][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:39:08,513][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:39:09,170][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:39:09,828][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:39:10,486][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:39:11,143][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:39:11,801][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:39:12,458][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:39:13,116][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:39:13,774][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:39:14,431][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:39:15,088][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:39:15,746][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:39:16,403][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:39:17,061][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:39:17,763][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 01:39:19,075][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:39:19,078][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:39:19,079][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:39:20,714][__main__][INFO] - Iteration 758 took 52s (9.43% Gen, 87.43% Train). Generation: 4s, Training: 45s. Estimated remaining time: 2h 58m 50s. Estimated total time: 14h 32m 5s. Time estimates for 10 more iterations: 8m 43s, 100 more iterations: 1h 27m 12s, 500 more iterations: 7h 16m 2s. [2026-03-26 01:39:20,717][__main__][INFO] - Starting iteration 758. [2026-03-26 01:39:20,722][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2026-03-26 01:39:20,722][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:39:27,211][__main__][INFO] - Number of regex retries in iteration 758: 0 [2026-03-26 01:39:27,213][__main__][INFO] - agents played in iteration 758 are Bob, Alice [2026-03-26 01:39:28,552][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:39:28,615][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:39:28,616][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:39:28,616][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:39:29,289][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:39:29,905][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:39:30,565][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:39:31,222][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:39:31,881][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:39:32,539][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:39:33,197][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:39:33,857][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:39:34,515][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:39:35,173][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:39:35,831][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:39:36,489][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:39:37,147][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:39:37,805][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:39:38,462][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:39:39,120][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:39:39,779][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:39:40,437][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:39:41,095][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:39:41,753][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:39:42,411][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:39:43,069][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:39:43,728][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:39:44,386][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:39:45,044][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:39:45,702][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:39:46,360][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:39:47,019][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:39:47,677][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:39:48,337][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:39:48,996][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:39:49,655][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:39:50,315][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:39:50,974][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:39:51,634][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:39:52,294][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:39:52,953][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:39:53,613][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:39:54,272][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:39:54,932][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:39:55,593][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:39:56,252][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:39:56,912][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:39:57,572][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:39:58,231][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:39:58,894][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:39:59,554][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:40:00,213][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:40:01,186][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:40:01,847][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:40:02,506][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:40:03,164][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:40:03,823][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:40:04,482][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:40:05,140][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:40:05,800][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:40:06,458][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:40:07,118][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:40:07,776][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:40:08,436][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:40:09,095][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:40:09,755][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:40:10,414][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:40:11,073][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:40:11,731][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:40:12,458][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 01:40:13,792][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:40:13,795][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:40:13,796][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:40:15,151][__main__][INFO] - Iteration 759 took 54s (11.92% Gen, 85.58% Train). Generation: 6s, Training: 46s. Estimated remaining time: 3h 33m 1s. Estimated total time: 15h 7m 11s. Time estimates for 10 more iterations: 9m 4s, 100 more iterations: 1h 30m 43s, 500 more iterations: 7h 33m 35s. [2026-03-26 01:40:15,154][__main__][INFO] - Starting iteration 759. [2026-03-26 01:40:15,160][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2026-03-26 01:40:15,161][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:40:21,200][__main__][INFO] - Number of regex retries in iteration 759: 0 [2026-03-26 01:40:21,201][__main__][INFO] - agents played in iteration 759 are Bob, Alice [2026-03-26 01:40:22,416][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:40:22,478][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:40:22,479][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:40:22,480][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:40:23,162][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:40:23,780][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:40:24,441][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:40:25,100][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:40:25,760][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:40:26,420][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:40:27,079][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:40:27,738][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:40:28,398][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:40:29,059][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:40:29,719][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:40:30,379][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:40:31,038][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:40:31,698][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:40:32,359][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:40:33,017][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:40:33,676][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:40:34,335][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:40:34,994][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:40:35,654][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:40:36,314][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:40:36,973][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:40:37,633][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:40:38,292][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:40:38,952][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:40:39,611][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:40:40,271][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:40:40,930][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:40:41,590][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:40:42,249][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:40:42,908][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:40:43,568][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:40:44,227][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:40:44,887][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:40:45,546][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:40:46,205][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:40:46,865][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:40:47,524][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:40:48,183][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:40:48,843][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:40:49,502][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:40:50,162][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:40:50,821][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:40:51,481][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:40:52,140][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:40:52,800][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:40:53,459][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:40:54,119][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:40:55,025][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:40:55,687][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:40:56,344][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:40:57,003][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:40:57,661][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:40:58,321][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:40:58,982][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:40:59,642][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:41:00,302][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:41:00,961][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:41:01,620][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:41:02,279][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:41:02,938][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:41:03,596][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:41:04,255][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:41:04,914][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:41:05,573][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:41:06,308][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 01:41:07,708][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:41:07,711][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:41:07,713][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:41:09,104][__main__][INFO] - Iteration 760 took 53s (11.20% Gen, 86.22% Train). Generation: 6s, Training: 46s. Estimated remaining time: 3h 24m 2s. Estimated total time: 14h 59m 6s. Time estimates for 10 more iterations: 8m 59s, 100 more iterations: 1h 29m 54s, 500 more iterations: 7h 29m 33s. [2026-03-26 01:41:09,106][__main__][INFO] - Starting iteration 760. [2026-03-26 01:41:09,112][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2026-03-26 01:41:09,113][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:41:14,049][__main__][INFO] - Number of regex retries in iteration 760: 0 [2026-03-26 01:41:14,050][__main__][INFO] - agents played in iteration 760 are Bob, Alice [2026-03-26 01:41:14,662][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:41:14,724][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:41:14,725][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:41:14,725][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:41:15,402][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:41:16,022][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:41:16,681][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:41:17,339][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:41:17,997][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:41:18,656][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:41:19,314][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:41:19,972][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:41:20,630][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:41:21,289][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:41:21,947][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:41:22,605][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:41:23,264][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:41:23,923][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:41:24,581][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:41:25,240][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:41:25,898][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:41:26,557][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:41:27,215][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:41:27,874][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:41:28,534][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:41:29,194][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:41:29,853][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:41:30,512][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:41:31,171][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:41:31,829][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:41:32,488][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:41:33,146][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:41:33,805][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:41:34,463][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:41:35,121][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:41:35,780][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:41:36,439][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:41:37,098][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:41:37,756][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:41:38,415][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:41:39,074][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:41:39,732][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:41:40,392][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:41:41,051][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:41:41,709][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:41:42,368][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:41:43,026][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:41:43,685][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:41:44,343][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:41:45,003][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:41:45,661][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:41:46,320][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:41:47,279][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:41:47,939][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:41:48,598][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:41:49,257][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:41:49,915][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:41:50,573][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:41:51,232][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:41:51,892][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:41:52,653][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:41:53,317][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:41:53,976][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:41:54,635][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:41:55,293][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:41:55,952][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:41:56,611][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:41:57,272][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:41:57,932][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:41:58,679][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 01:42:00,120][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:42:00,123][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:42:00,124][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:42:01,433][__main__][INFO] - Iteration 761 took 52s (9.44% Gen, 88.06% Train). Generation: 4s, Training: 46s. Estimated remaining time: 2h 56m 6s. Estimated total time: 14h 32m 2s. Time estimates for 10 more iterations: 8m 43s, 100 more iterations: 1h 27m 12s, 500 more iterations: 7h 16m 1s. [2026-03-26 01:42:01,436][__main__][INFO] - Starting iteration 761. [2026-03-26 01:42:01,447][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2026-03-26 01:42:01,448][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:42:06,385][__main__][INFO] - Number of regex retries in iteration 761: 0 [2026-03-26 01:42:06,386][__main__][INFO] - agents played in iteration 761 are Bob, Alice [2026-03-26 01:42:06,906][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:42:06,968][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:42:06,968][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:42:06,969][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:42:07,658][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:42:08,283][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:42:08,935][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:42:09,593][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:42:10,252][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:42:10,910][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:42:11,568][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:42:12,227][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:42:12,885][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:42:13,543][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:42:14,202][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:42:14,860][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:42:15,519][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:42:16,178][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:42:16,836][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:42:17,494][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:42:18,153][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:42:18,811][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:42:19,470][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:42:20,128][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:42:20,786][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:42:21,445][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:42:22,104][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:42:22,763][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:42:23,422][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:42:24,080][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:42:24,740][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:42:25,399][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:42:26,057][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:42:26,716][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:42:27,375][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:42:28,033][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:42:28,693][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:42:29,352][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:42:30,011][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:42:30,669][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:42:31,327][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:42:31,986][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:42:32,645][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:42:33,303][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:42:33,962][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:42:34,621][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:42:35,280][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:42:35,940][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:42:36,597][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:42:37,256][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:42:37,915][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:42:38,573][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:42:39,483][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:42:40,144][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:42:40,803][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:42:41,462][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:42:42,120][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:42:42,779][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:42:43,439][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:42:44,098][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:42:44,757][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:42:45,415][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:42:46,074][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:42:46,733][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:42:47,392][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:42:48,050][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:42:48,709][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:42:49,368][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:42:50,027][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:42:50,736][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 01:42:52,054][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:42:52,057][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:42:52,058][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:42:53,499][__main__][INFO] - Iteration 762 took 52s (9.49% Gen, 87.75% Train). Generation: 4s, Training: 45s. Estimated remaining time: 2h 50m 45s. Estimated total time: 14h 27m 33s. Time estimates for 10 more iterations: 8m 40s, 100 more iterations: 1h 26m 45s, 500 more iterations: 7h 13m 46s. [2026-03-26 01:42:53,502][__main__][INFO] - Starting iteration 762. [2026-03-26 01:42:53,505][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2026-03-26 01:42:53,506][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:42:58,424][__main__][INFO] - Number of regex retries in iteration 762: 0 [2026-03-26 01:42:58,425][__main__][INFO] - agents played in iteration 762 are Bob, Alice [2026-03-26 01:42:59,033][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:42:59,097][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:42:59,098][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:42:59,099][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:42:59,773][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:43:00,443][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:43:01,054][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:43:01,712][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:43:02,371][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:43:03,029][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:43:03,687][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:43:04,345][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:43:05,003][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:43:05,662][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:43:06,320][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:43:06,978][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:43:07,637][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:43:08,295][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:43:08,954][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:43:09,613][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:43:10,272][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:43:10,930][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:43:11,589][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:43:12,248][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:43:12,907][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:43:13,565][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:43:14,224][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:43:14,883][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:43:15,542][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:43:16,201][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:43:16,860][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:43:17,519][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:43:18,177][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:43:18,835][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:43:19,494][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:43:20,152][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:43:20,811][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:43:21,471][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:43:22,129][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:43:22,788][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:43:23,446][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:43:24,105][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:43:24,764][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:43:25,422][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:43:26,081][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:43:26,739][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:43:27,398][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:43:28,056][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:43:28,715][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:43:29,374][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:43:30,032][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:43:30,691][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:43:31,672][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:43:32,332][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:43:32,991][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:43:33,650][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:43:34,308][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:43:34,966][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:43:35,625][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:43:36,284][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:43:36,943][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:43:37,602][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:43:38,260][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:43:38,919][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:43:39,577][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:43:40,237][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:43:40,895][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:43:41,554][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:43:42,212][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:43:42,969][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 01:43:44,331][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:43:44,334][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:43:44,335][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:43:45,958][__main__][INFO] - Iteration 763 took 52s (9.38% Gen, 87.52% Train). Generation: 4s, Training: 45s. Estimated remaining time: 2h 56m 33s. Estimated total time: 14h 34m 14s. Time estimates for 10 more iterations: 8m 44s, 100 more iterations: 1h 27m 25s, 500 more iterations: 7h 17m 7s. [2026-03-26 01:43:45,961][__main__][INFO] - Starting iteration 763. [2026-03-26 01:43:45,965][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2026-03-26 01:43:45,966][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:43:50,950][__main__][INFO] - Number of regex retries in iteration 763: 0 [2026-03-26 01:43:50,952][__main__][INFO] - agents played in iteration 763 are Bob, Alice [2026-03-26 01:43:51,434][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:43:51,495][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:43:51,496][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:43:51,496][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:43:52,166][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:43:52,786][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:43:53,442][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:43:54,100][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:43:54,759][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:43:55,417][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:43:56,075][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:43:56,734][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:43:57,392][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:43:58,051][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:43:58,710][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:43:59,369][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:44:00,027][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:44:00,686][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:44:01,345][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:44:02,003][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:44:02,662][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:44:03,320][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:44:03,979][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:44:04,637][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:44:05,296][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:44:05,954][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:44:06,613][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:44:07,272][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:44:07,931][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:44:08,589][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:44:09,247][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:44:09,906][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:44:10,565][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:44:11,223][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:44:11,881][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:44:12,540][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:44:13,199][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:44:13,858][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:44:14,516][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:44:15,175][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:44:15,833][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:44:16,492][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:44:17,150][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:44:17,809][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:44:18,468][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:44:19,126][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:44:19,784][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:44:20,442][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:44:21,103][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:44:21,762][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:44:22,421][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:44:23,079][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:44:23,983][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:44:25,761][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:44:26,419][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:44:27,077][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:44:27,736][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:44:28,395][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:44:29,056][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:44:29,715][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:44:30,373][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:44:31,032][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:44:31,690][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:44:32,349][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:44:33,008][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:44:33,666][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:44:34,325][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:44:34,983][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:44:35,643][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:44:36,351][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:44 [2026-03-26 01:44:37,693][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:44:37,696][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:44:37,697][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:44:39,336][__main__][INFO] - Iteration 764 took 53s (9.34% Gen, 87.58% Train). Generation: 4s, Training: 46s. Estimated remaining time: 3h 10m 58s. Estimated total time: 14h 49m 32s. Time estimates for 10 more iterations: 8m 53s, 100 more iterations: 1h 28m 57s, 500 more iterations: 7h 24m 46s. [2026-03-26 01:44:39,338][__main__][INFO] - Starting iteration 764. [2026-03-26 01:44:39,343][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2026-03-26 01:44:39,344][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:44:44,144][__main__][INFO] - Number of regex retries in iteration 764: 0 [2026-03-26 01:44:44,145][__main__][INFO] - agents played in iteration 764 are Bob, Alice [2026-03-26 01:44:45,351][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:44:45,412][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:44:45,412][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:44:45,413][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:44:46,109][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:44:46,724][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:44:47,384][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:44:48,042][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:44:48,701][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:44:49,360][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:44:50,018][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:44:50,676][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:44:51,334][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:44:51,992][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:44:52,651][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:44:53,309][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:44:53,968][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:44:54,626][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:44:55,285][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:44:55,943][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:44:56,601][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:44:57,259][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:44:57,918][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:44:58,578][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:44:59,239][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:44:59,898][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:45:00,556][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:45:01,215][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:45:01,874][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:45:02,532][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:45:03,190][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:45:03,848][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:45:04,507][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:45:05,166][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:45:05,824][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:45:06,483][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:45:07,142][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:45:07,800][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:45:08,459][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:45:09,117][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:45:09,776][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:45:10,435][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:45:11,093][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:45:11,752][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:45:12,410][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:45:13,070][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:45:13,729][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:45:14,388][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:45:15,047][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:45:15,705][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:45:16,365][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:45:17,023][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:45:17,977][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:45:18,638][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:45:19,297][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:45:19,955][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:45:20,615][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:45:21,273][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:45:21,932][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:45:22,591][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:45:23,249][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:45:23,908][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:45:24,567][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:45:25,226][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:45:25,885][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:45:26,544][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:45:27,203][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:45:27,862][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:45:28,522][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:45:29,248][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 01:45:30,557][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:45:30,560][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:45:30,561][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:45:32,180][__main__][INFO] - Iteration 765 took 52s (9.08% Gen, 87.84% Train). Generation: 4s, Training: 46s. Estimated remaining time: 3h 1m 12s. Estimated total time: 14h 40m 39s. Time estimates for 10 more iterations: 8m 48s, 100 more iterations: 1h 28m 3s, 500 more iterations: 7h 20m 19s. [2026-03-26 01:45:32,183][__main__][INFO] - Starting iteration 765. [2026-03-26 01:45:32,189][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2026-03-26 01:45:32,189][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:45:38,225][__main__][INFO] - Number of regex retries in iteration 765: 0 [2026-03-26 01:45:38,226][__main__][INFO] - agents played in iteration 765 are Bob, Alice [2026-03-26 01:45:39,371][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:45:39,432][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:45:39,433][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:45:39,434][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:45:40,100][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:45:40,711][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:45:41,372][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:45:42,030][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:45:42,688][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:45:43,347][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:45:44,005][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:45:44,663][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:45:45,322][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:45:45,979][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:45:46,638][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:45:47,297][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:45:47,955][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:45:48,614][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:45:49,272][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:45:49,931][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:45:50,589][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:45:51,247][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:45:51,906][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:45:52,564][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:45:53,223][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:45:53,881][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:45:54,539][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:45:55,198][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:45:55,857][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:45:56,515][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:45:57,174][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:45:57,832][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:45:58,490][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:45:59,148][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:45:59,806][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:46:00,463][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:46:01,120][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:46:01,777][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:46:02,435][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:46:03,092][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:46:03,750][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:46:04,407][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:46:05,064][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:46:05,722][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:46:06,379][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:46:07,037][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:46:07,695][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:46:08,352][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:46:09,009][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:46:09,667][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:46:10,325][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:46:10,983][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:46:11,878][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:46:12,536][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:46:13,194][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:46:13,852][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:46:14,509][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:46:15,166][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:46:15,824][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:46:16,483][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:46:17,139][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:46:17,796][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:46:18,453][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:46:19,111][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:46:19,769][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:46:20,426][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:46:21,084][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:46:21,742][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:46:22,399][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:46:23,098][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 01:46:24,536][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:46:24,539][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:46:24,560][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:46:26,243][__main__][INFO] - Iteration 766 took 54s (11.17% Gen, 85.71% Train). Generation: 6s, Training: 46s. Estimated remaining time: 3h 20m 35s. Estimated total time: 15h 0m 56s. Time estimates for 10 more iterations: 9m 0s, 100 more iterations: 1h 30m 5s, 500 more iterations: 7h 30m 28s. [2026-03-26 01:46:26,245][__main__][INFO] - Starting iteration 766. [2026-03-26 01:46:26,251][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2026-03-26 01:46:26,252][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:46:32,269][__main__][INFO] - Number of regex retries in iteration 766: 0 [2026-03-26 01:46:32,270][__main__][INFO] - agents played in iteration 766 are Bob, Alice [2026-03-26 01:46:32,844][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:46:32,905][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:46:32,906][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:46:32,907][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:46:33,582][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:46:34,192][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:46:34,851][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:46:35,508][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:46:36,165][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:46:36,822][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:46:37,478][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:46:38,136][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:46:38,793][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:46:39,450][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:46:40,107][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:46:40,765][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:46:41,423][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:46:42,080][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:46:42,738][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:46:43,394][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:46:44,052][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:46:44,709][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:46:45,366][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:46:46,023][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:46:46,680][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:46:47,337][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:46:47,995][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:46:48,652][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:46:49,309][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:46:49,966][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:46:50,623][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:46:51,280][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:46:51,938][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:46:52,595][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:46:53,252][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:46:53,909][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:46:54,567][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:46:55,224][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:46:55,881][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:46:56,538][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:46:57,195][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:46:57,853][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:46:58,510][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:46:59,168][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:46:59,825][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:47:00,482][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:47:01,139][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:47:01,796][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:47:02,453][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:47:03,111][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:47:03,768][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:47:04,425][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:47:05,377][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:47:06,035][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:47:06,692][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:47:07,349][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:47:08,007][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:47:08,665][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:47:09,323][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:47:09,980][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:47:10,639][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:47:11,297][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:47:11,955][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:47:12,613][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:47:13,270][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:47:13,927][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:47:14,585][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:47:15,242][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:47:15,900][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:47:16,613][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 01:47:17,947][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:47:17,950][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:47:17,951][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:47:19,277][__main__][INFO] - Iteration 767 took 53s (11.35% Gen, 86.14% Train). Generation: 6s, Training: 45s. Estimated remaining time: 3h 2m 34s. Estimated total time: 14h 43m 48s. Time estimates for 10 more iterations: 8m 50s, 100 more iterations: 1h 28m 22s, 500 more iterations: 7h 21m 54s. [2026-03-26 01:47:19,280][__main__][INFO] - Starting iteration 767. [2026-03-26 01:47:19,283][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2026-03-26 01:47:19,284][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:47:24,066][__main__][INFO] - Number of regex retries in iteration 767: 0 [2026-03-26 01:47:24,067][__main__][INFO] - agents played in iteration 767 are Bob, Alice [2026-03-26 01:47:24,704][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:47:24,765][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:47:24,766][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:47:24,766][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:47:25,437][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:47:26,051][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:47:26,710][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:47:27,367][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:47:28,024][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:47:28,682][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:47:29,339][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:47:29,996][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:47:30,654][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:47:31,311][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:47:31,968][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:47:32,625][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:47:33,282][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:47:33,939][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:47:34,596][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:47:35,254][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:47:35,911][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:47:36,568][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:47:37,226][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:47:37,883][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:47:38,541][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:47:39,198][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:47:39,856][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:47:40,513][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:47:41,170][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:47:41,828][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:47:42,485][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:47:43,142][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:47:43,800][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:47:44,457][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:47:45,114][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:47:45,772][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:47:46,429][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:47:47,087][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:47:47,744][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:47:48,402][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:47:49,063][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:47:49,717][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:47:50,375][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:47:51,033][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:47:51,690][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:47:52,349][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:47:53,006][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:47:53,664][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:47:54,321][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:47:54,979][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:47:55,636][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:47:56,294][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:47:57,182][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:47:57,842][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:47:58,500][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:48:00,237][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:48:00,894][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:48:01,552][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:48:02,209][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:48:02,866][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:48:03,524][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:48:04,181][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:48:04,839][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:48:05,496][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:48:06,154][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:48:06,811][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:48:07,469][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:48:08,126][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:48:08,783][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:48:09,476][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:44 [2026-03-26 01:48:10,947][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:48:10,950][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:48:10,951][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:48:12,324][__main__][INFO] - Iteration 768 took 53s (9.02% Gen, 88.39% Train). Generation: 4s, Training: 46s. Estimated remaining time: 3h 1m 55s. Estimated total time: 14h 44m 2s. Time estimates for 10 more iterations: 8m 50s, 100 more iterations: 1h 28m 24s, 500 more iterations: 7h 22m 1s. [2026-03-26 01:48:12,328][__main__][INFO] - Starting iteration 768. [2026-03-26 01:48:12,331][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2026-03-26 01:48:12,332][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:48:17,537][__main__][INFO] - Number of regex retries in iteration 768: 0 [2026-03-26 01:48:17,539][__main__][INFO] - agents played in iteration 768 are Bob, Alice [2026-03-26 01:48:18,036][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:48:18,098][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:48:18,099][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:48:18,099][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:48:18,765][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:48:19,377][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:48:20,037][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:48:20,695][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:48:21,353][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:48:22,011][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:48:22,669][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:48:23,327][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:48:23,986][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:48:24,643][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:48:25,301][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:48:25,959][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:48:26,617][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:48:27,275][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:48:27,933][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:48:28,593][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:48:29,251][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:48:29,909][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:48:30,567][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:48:31,225][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:48:31,884][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:48:32,543][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:48:33,201][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:48:33,859][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:48:34,517][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:48:35,176][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:48:35,836][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:48:36,494][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:48:37,153][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:48:37,811][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:48:38,470][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:48:39,129][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:48:39,788][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:48:40,447][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:48:41,106][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:48:41,764][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:48:42,423][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:48:43,082][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:48:43,741][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:48:44,399][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:48:45,058][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:48:45,716][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:48:46,374][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:48:47,032][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:48:47,691][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:48:48,349][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:48:49,008][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:48:49,666][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:48:50,616][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:48:51,274][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:48:51,932][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:48:52,589][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:48:53,246][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:48:53,904][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:48:54,562][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:48:55,219][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:48:55,876][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:48:56,534][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:48:57,191][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:48:57,848][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:48:58,507][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:48:59,164][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:48:59,822][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:49:00,479][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:49:01,137][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:49:01,860][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 01:49:03,280][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:49:03,285][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:49:03,287][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:49:05,337][__main__][INFO] - Iteration 769 took 53s (9.82% Gen, 86.31% Train). Generation: 5s, Training: 45s. Estimated remaining time: 3h 0m 26s. Estimated total time: 14h 43m 26s. Time estimates for 10 more iterations: 8m 50s, 100 more iterations: 1h 28m 20s, 500 more iterations: 7h 21m 43s. [2026-03-26 01:49:05,340][__main__][INFO] - Starting iteration 769. [2026-03-26 01:49:05,344][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2026-03-26 01:49:05,344][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:49:10,394][__main__][INFO] - Number of regex retries in iteration 769: 0 [2026-03-26 01:49:10,395][__main__][INFO] - agents played in iteration 769 are Bob, Alice [2026-03-26 01:49:11,032][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:49:11,094][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:49:11,095][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:49:11,096][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:49:11,764][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:49:12,376][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:49:13,035][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:49:13,692][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:49:14,349][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:49:15,007][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:49:15,665][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:49:16,321][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:49:16,978][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:49:17,635][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:49:18,292][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:49:18,950][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:49:19,607][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:49:20,265][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:49:20,923][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:49:21,581][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:49:22,238][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:49:22,897][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:49:23,554][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:49:24,211][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:49:24,868][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:49:25,525][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:49:26,182][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:49:26,839][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:49:27,497][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:49:28,154][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:49:28,812][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:49:29,470][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:49:30,127][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:49:30,784][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:49:31,442][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:49:32,099][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:49:32,757][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:49:33,414][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:49:34,071][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:49:34,729][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:49:35,387][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:49:36,044][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:49:36,701][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:49:37,359][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:49:38,016][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:49:38,674][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:49:39,332][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:49:39,990][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:49:40,647][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:49:41,304][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:49:41,962][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:49:42,619][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:49:43,522][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:49:44,180][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:49:44,837][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:49:45,495][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:49:46,152][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:49:46,809][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:49:47,467][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:49:48,124][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:49:48,782][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:49:49,439][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:49:50,097][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:49:50,754][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:49:51,412][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:49:52,070][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:49:52,727][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:49:53,385][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:49:54,042][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:49:54,748][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:42 [2026-03-26 01:49:56,140][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:49:56,143][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:49:56,144][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:49:57,489][__main__][INFO] - Iteration 770 took 52s (9.69% Gen, 87.73% Train). Generation: 5s, Training: 45s. Estimated remaining time: 2h 45m 15s. Estimated total time: 14h 29m 7s. Time estimates for 10 more iterations: 8m 41s, 100 more iterations: 1h 26m 54s, 500 more iterations: 7h 14m 33s. [2026-03-26 01:49:57,492][__main__][INFO] - Starting iteration 770. [2026-03-26 01:49:57,496][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2026-03-26 01:49:57,496][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:50:02,298][__main__][INFO] - Number of regex retries in iteration 770: 0 [2026-03-26 01:50:02,299][__main__][INFO] - agents played in iteration 770 are Bob, Alice [2026-03-26 01:50:02,780][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:50:02,841][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:50:02,842][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:50:02,843][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:50:03,509][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:50:04,128][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:50:04,787][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:50:05,445][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:50:06,104][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:50:06,763][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:50:07,421][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:50:08,080][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:50:08,738][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:50:09,397][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:50:10,055][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:50:10,713][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:50:11,372][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:50:12,030][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:50:12,690][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:50:13,348][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:50:14,006][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:50:14,664][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:50:15,322][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:50:15,981][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:50:16,639][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:50:17,296][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:50:17,955][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:50:18,613][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:50:19,271][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:50:19,929][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:50:20,587][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:50:21,245][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:50:21,903][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:50:22,562][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:50:23,221][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:50:23,879][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:50:24,538][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:50:25,196][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:50:25,854][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:50:26,512][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:50:27,171][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:50:27,829][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:50:28,488][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:50:29,146][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:50:29,804][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:50:30,462][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:50:31,121][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:50:31,779][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:50:32,437][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:50:33,095][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:50:33,754][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:50:34,412][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:50:35,366][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:50:36,024][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:50:36,681][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:50:37,339][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:50:37,996][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:50:38,654][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:50:39,311][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:50:39,970][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:50:40,627][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:50:41,284][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:50:41,942][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:50:42,599][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:50:43,257][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:50:43,915][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:50:44,572][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:50:45,230][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:50:45,887][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:50:46,612][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 01:50:47,961][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:50:47,964][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:50:47,966][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:50:49,287][__main__][INFO] - Iteration 771 took 51s (9.27% Gen, 88.17% Train). Generation: 4s, Training: 45s. Estimated remaining time: 2h 38m 28s. Estimated total time: 14h 23m 12s. Time estimates for 10 more iterations: 8m 37s, 100 more iterations: 1h 26m 19s, 500 more iterations: 7h 11m 36s. [2026-03-26 01:50:49,289][__main__][INFO] - Starting iteration 771. [2026-03-26 01:50:49,293][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2026-03-26 01:50:49,294][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:50:55,343][__main__][INFO] - Number of regex retries in iteration 771: 0 [2026-03-26 01:50:55,345][__main__][INFO] - agents played in iteration 771 are Bob, Alice [2026-03-26 01:50:56,518][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:50:56,578][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:50:56,579][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:50:56,580][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:50:57,248][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:50:57,859][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:50:58,517][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:50:59,174][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:50:59,831][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:51:00,488][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:51:01,146][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:51:01,802][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:51:02,459][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:51:03,116][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:51:03,773][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:51:04,431][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:51:05,088][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:51:05,745][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:51:06,403][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:51:07,060][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:51:07,717][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:51:08,375][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:51:09,032][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:51:09,690][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:51:10,348][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:51:11,005][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:51:11,662][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:51:12,320][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:51:12,977][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:51:13,634][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:51:14,292][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:51:14,949][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:51:15,606][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:51:16,264][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:51:16,921][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:51:17,578][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:51:18,236][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:51:18,893][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:51:19,550][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:51:20,208][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:51:20,867][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:51:21,523][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:51:22,181][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:51:22,839][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:51:23,496][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:51:24,154][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:51:24,811][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:51:25,469][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:51:26,126][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:51:26,784][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:51:27,441][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:51:28,099][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:51:28,995][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:51:29,653][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:51:30,311][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:51:30,968][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:51:31,625][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:51:32,282][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:51:32,940][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:51:33,597][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:51:34,254][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:51:34,911][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:51:35,569][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:51:36,227][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:51:36,884][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:51:37,542][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:51:38,199][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:51:38,856][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:51:39,514][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:51:40,207][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:42 [2026-03-26 01:51:41,546][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:51:41,549][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:51:41,551][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:51:44,691][__main__][INFO] - Iteration 772 took 55s (10.92% Gen, 83.40% Train). Generation: 6s, Training: 46s. Estimated remaining time: 3h 37m 40s. Estimated total time: 15h 23m 19s. Time estimates for 10 more iterations: 9m 13s, 100 more iterations: 1h 32m 19s, 500 more iterations: 7h 41m 39s. [2026-03-26 01:51:44,693][__main__][INFO] - Starting iteration 772. [2026-03-26 01:51:44,698][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2026-03-26 01:51:44,699][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:51:51,049][__main__][INFO] - Number of regex retries in iteration 772: 0 [2026-03-26 01:51:51,051][__main__][INFO] - agents played in iteration 772 are Bob, Alice [2026-03-26 01:51:52,248][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:51:52,310][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:51:52,311][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:51:52,312][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:51:52,997][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:51:53,601][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:51:54,261][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:51:54,918][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:51:55,576][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:51:56,233][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:51:56,892][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:51:57,549][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:51:58,207][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:51:58,865][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:51:59,524][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:52:00,181][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:52:00,839][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:52:01,498][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:52:02,156][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:52:02,813][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:52:03,471][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:52:04,129][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:52:04,787][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:52:05,445][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:52:06,104][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:52:06,762][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:52:07,420][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:52:08,078][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:52:08,736][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:52:09,394][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:52:10,053][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:52:10,712][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:52:11,370][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:52:12,028][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:52:12,686][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:52:13,344][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:52:14,002][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:52:14,661][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:52:15,319][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:52:15,977][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:52:16,635][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:52:17,293][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:52:17,951][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:52:18,609][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:52:19,268][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:52:19,926][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:52:20,584][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:52:21,242][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:52:21,901][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:52:22,559][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:52:23,217][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:52:23,876][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:52:24,823][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:52:25,481][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:52:26,138][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:52:26,796][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:52:27,453][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:52:28,110][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:52:28,769][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:52:29,426][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:52:30,084][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:52:30,742][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:52:31,400][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:52:32,057][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:52:32,714][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:52:33,371][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:52:34,029][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:52:34,686][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:52:35,344][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:52:36,071][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 01:52:37,421][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:52:37,424][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:52:37,425][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:52:38,957][__main__][INFO] - Iteration 773 took 54s (11.71% Gen, 85.47% Train). Generation: 6s, Training: 46s. Estimated remaining time: 3h 17m 46s. Estimated total time: 15h 4m 20s. Time estimates for 10 more iterations: 9m 2s, 100 more iterations: 1h 30m 26s, 500 more iterations: 7h 32m 10s. [2026-03-26 01:52:38,961][__main__][INFO] - Starting iteration 773. [2026-03-26 01:52:38,965][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2026-03-26 01:52:38,966][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:52:45,500][__main__][INFO] - Number of regex retries in iteration 773: 0 [2026-03-26 01:52:45,501][__main__][INFO] - agents played in iteration 773 are Bob, Alice [2026-03-26 01:52:46,126][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:52:46,189][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:52:46,189][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:52:46,190][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:52:46,872][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:52:47,484][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:52:48,143][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:52:48,799][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:52:49,456][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:52:50,113][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:52:50,769][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:52:51,426][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:52:52,083][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:52:52,740][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:52:53,397][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:52:54,054][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:52:54,711][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:52:55,368][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:52:56,024][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:52:56,682][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:52:57,339][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:52:57,996][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:52:58,653][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:52:59,310][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:52:59,967][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:53:00,624][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:53:01,281][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:53:01,938][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:53:02,595][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:53:03,252][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:53:03,910][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:53:04,567][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:53:05,224][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:53:05,881][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:53:06,538][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:53:07,196][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:53:07,853][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:53:08,510][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:53:09,168][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:53:09,825][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:53:10,483][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:53:11,140][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:53:11,798][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:53:12,455][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:53:13,112][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:53:13,770][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:53:14,428][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:53:15,085][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:53:15,743][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:53:16,400][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:53:17,058][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:53:17,716][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:53:18,619][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:53:19,276][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:53:19,934][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:53:20,591][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:53:21,248][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:53:21,906][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:53:22,563][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:53:23,220][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:53:23,878][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:53:24,535][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:53:25,193][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:53:25,850][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:53:26,508][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:53:27,165][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:53:27,822][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:53:28,480][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:53:29,138][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:53:29,834][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:42 [2026-03-26 01:53:31,164][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:53:31,166][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:53:31,168][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:53:32,615][__main__][INFO] - Iteration 774 took 53s (12.18% Gen, 85.12% Train). Generation: 6s, Training: 45s. Estimated remaining time: 3h 6m 44s. Estimated total time: 14h 54m 11s. Time estimates for 10 more iterations: 8m 56s, 100 more iterations: 1h 29m 25s, 500 more iterations: 7h 27m 5s. [2026-03-26 01:53:32,618][__main__][INFO] - Starting iteration 774. [2026-03-26 01:53:32,622][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2026-03-26 01:53:32,623][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:53:38,365][__main__][INFO] - Number of regex retries in iteration 774: 0 [2026-03-26 01:53:38,366][__main__][INFO] - agents played in iteration 774 are Bob, Alice [2026-03-26 01:53:38,977][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:53:39,039][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:53:39,039][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:53:39,040][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:53:39,754][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:53:40,368][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:53:41,026][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:53:41,683][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:53:42,340][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:53:42,997][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:53:43,654][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:53:44,310][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:53:44,968][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:53:45,625][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:53:46,282][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:53:46,940][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:53:47,597][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:53:48,254][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:53:48,912][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:53:49,569][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:53:50,226][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:53:50,883][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:53:51,540][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:53:52,197][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:53:52,855][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:53:53,512][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:53:54,169][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:53:54,826][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:53:55,483][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:53:56,140][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:53:56,798][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:53:57,455][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:53:58,113][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:53:58,771][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:53:59,428][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:54:00,085][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:54:00,743][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:54:01,400][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:54:02,057][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:54:02,714][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:54:03,372][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:54:04,029][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:54:04,687][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:54:05,344][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:54:06,001][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:54:06,659][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:54:07,316][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:54:07,973][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:54:08,630][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:54:09,287][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:54:09,945][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:54:10,602][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:54:11,546][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:54:12,204][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:54:12,861][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:54:13,519][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:54:14,176][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:54:14,833][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:54:15,491][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:54:16,148][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:54:16,805][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:54:17,463][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:54:18,120][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:54:18,778][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:54:19,435][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:54:20,092][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:54:20,750][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:54:21,407][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:54:22,064][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:54:22,781][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 01:54:24,177][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:54:24,180][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:54:24,181][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:54:25,546][__main__][INFO] - Iteration 775 took 52s (10.85% Gen, 86.57% Train). Generation: 5s, Training: 45s. Estimated remaining time: 2h 53m 46s. Estimated total time: 14h 42m 6s. Time estimates for 10 more iterations: 8m 49s, 100 more iterations: 1h 28m 12s, 500 more iterations: 7h 21m 3s. [2026-03-26 01:54:25,549][__main__][INFO] - Starting iteration 775. [2026-03-26 01:54:25,553][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2026-03-26 01:54:25,553][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:54:32,378][__main__][INFO] - Number of regex retries in iteration 775: 0 [2026-03-26 01:54:32,379][__main__][INFO] - agents played in iteration 775 are Bob, Alice [2026-03-26 01:54:32,887][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:54:32,947][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:54:32,948][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:54:32,949][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:54:33,612][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:54:34,226][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:54:34,884][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:54:35,541][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:54:36,198][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:54:36,854][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:54:37,511][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:54:38,168][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:54:38,825][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:54:39,482][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:54:40,139][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:54:40,796][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:54:41,452][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:54:42,110][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:54:42,767][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:54:43,424][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:54:44,081][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:54:44,738][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:54:45,395][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:54:46,052][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:54:46,709][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:54:47,367][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:54:48,024][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:54:48,685][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:54:49,343][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:54:50,002][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:54:50,661][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:54:51,319][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:54:51,977][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:54:52,636][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:54:53,294][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:54:53,952][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:54:54,611][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:54:55,269][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:54:55,928][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:54:56,586][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:54:57,244][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:54:57,903][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:54:58,562][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:54:59,223][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:54:59,882][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:55:00,541][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:55:01,200][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:55:01,858][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:55:02,516][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:55:03,175][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:55:03,833][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:55:04,492][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:55:05,398][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:55:06,058][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:55:06,716][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:55:07,376][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:55:08,035][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:55:08,693][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:55:09,352][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:55:10,011][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:55:10,669][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:55:11,327][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:55:11,985][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:55:12,644][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:55:13,303][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:55:13,961][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:55:14,620][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:55:15,279][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:55:15,937][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:55:16,642][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 01:55:18,023][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:55:18,025][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:55:18,027][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:55:19,374][__main__][INFO] - Iteration 776 took 53s (12.68% Gen, 84.81% Train). Generation: 6s, Training: 45s. Estimated remaining time: 3h 7m 48s. Estimated total time: 14h 57m 2s. Time estimates for 10 more iterations: 8m 58s, 100 more iterations: 1h 29m 42s, 500 more iterations: 7h 28m 31s. [2026-03-26 01:55:19,376][__main__][INFO] - Starting iteration 776. [2026-03-26 01:55:19,379][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2026-03-26 01:55:19,380][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:55:24,587][__main__][INFO] - Number of regex retries in iteration 776: 0 [2026-03-26 01:55:24,588][__main__][INFO] - agents played in iteration 776 are Bob, Alice [2026-03-26 01:55:25,195][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:55:25,257][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:55:25,258][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:55:25,259][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:55:25,927][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:55:26,549][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:55:27,210][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:55:27,868][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:55:28,526][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:55:29,184][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:55:29,842][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:55:30,500][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:55:31,159][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:55:31,817][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:55:32,475][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:55:33,133][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:55:33,791][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:55:34,449][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:55:35,108][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:55:35,766][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:55:36,424][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:55:37,082][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:55:37,740][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:55:38,398][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:55:39,057][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:55:39,715][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:55:40,373][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:55:41,031][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:55:41,690][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:55:42,348][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:55:43,006][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:55:43,664][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:55:44,322][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:55:44,981][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:55:45,639][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:55:46,297][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:55:46,956][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:55:47,614][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:55:48,272][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:55:48,931][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:55:49,589][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:55:50,247][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:55:50,906][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:55:51,564][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:55:52,222][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:55:52,880][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:55:53,539][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:55:54,197][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:55:54,855][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:55:55,514][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:55:56,172][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:55:56,830][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:55:57,782][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:55:58,440][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:55:59,099][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:55:59,756][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:56:00,413][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:56:01,070][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:56:01,727][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:56:02,385][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:56:03,043][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:56:03,700][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:56:04,357][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:56:05,015][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:56:05,672][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:56:06,329][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:56:06,987][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:56:07,644][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:56:08,301][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:56:09,025][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 01:56:10,347][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:56:10,350][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:56:10,351][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:56:11,906][__main__][INFO] - Iteration 777 took 52s (9.92% Gen, 87.12% Train). Generation: 5s, Training: 45s. Estimated remaining time: 2h 45m 21s. Estimated total time: 14h 35m 28s. Time estimates for 10 more iterations: 8m 45s, 100 more iterations: 1h 27m 32s, 500 more iterations: 7h 17m 44s. [2026-03-26 01:56:11,908][__main__][INFO] - Starting iteration 777. [2026-03-26 01:56:11,913][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2026-03-26 01:56:11,913][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:56:18,413][__main__][INFO] - Number of regex retries in iteration 777: 0 [2026-03-26 01:56:18,415][__main__][INFO] - agents played in iteration 777 are Bob, Alice [2026-03-26 01:56:19,441][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:56:19,502][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:56:19,503][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:56:19,504][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:56:20,184][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:56:20,803][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:56:21,460][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:56:22,116][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:56:22,774][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:56:23,431][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:56:24,088][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:56:24,745][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:56:25,402][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:56:26,059][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:56:26,717][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:56:27,374][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:56:28,031][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:56:28,688][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:56:29,345][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:56:30,002][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:56:30,660][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:56:31,317][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:56:31,974][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:56:32,631][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:56:33,289][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:56:33,946][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:56:34,603][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:56:35,261][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:56:35,918][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:56:36,575][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:56:37,233][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:56:37,890][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:56:38,548][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:56:39,205][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:56:39,863][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:56:40,521][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:56:41,178][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:56:41,835][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:56:42,492][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:56:43,150][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:56:43,807][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:56:44,464][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:56:45,121][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:56:45,779][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:56:46,436][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:56:47,093][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:56:47,750][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:56:48,408][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:56:49,065][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:56:49,722][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:56:50,380][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:56:51,037][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:56:51,935][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:56:52,593][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:56:53,251][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:56:53,908][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:56:54,565][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:56:55,222][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:56:55,880][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:56:56,538][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:56:57,195][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:56:57,852][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:56:58,511][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:56:59,168][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:56:59,826][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:57:00,483][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:57:01,141][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:57:01,799][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:57:02,456][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:57:03,171][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:42 [2026-03-26 01:57:04,558][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:57:04,561][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:57:04,563][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:57:05,856][__main__][INFO] - Iteration 778 took 53s (12.05% Gen, 85.55% Train). Generation: 6s, Training: 46s. Estimated remaining time: 3h 8m 4s. Estimated total time: 14h 59m 5s. Time estimates for 10 more iterations: 8m 59s, 100 more iterations: 1h 29m 54s, 500 more iterations: 7h 29m 32s. [2026-03-26 01:57:05,859][__main__][INFO] - Starting iteration 778. [2026-03-26 01:57:05,864][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2026-03-26 01:57:05,865][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:57:11,998][__main__][INFO] - Number of regex retries in iteration 778: 0 [2026-03-26 01:57:11,999][__main__][INFO] - agents played in iteration 778 are Bob, Alice [2026-03-26 01:57:13,181][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:57:13,242][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:57:13,243][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:57:13,243][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:57:13,901][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:57:14,538][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:57:15,177][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:57:15,833][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:57:16,490][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:57:17,147][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:57:17,804][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:57:18,461][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:57:19,118][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:57:19,775][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:57:20,432][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:57:21,089][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:57:21,746][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:57:22,403][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:57:23,060][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:57:23,717][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:57:24,374][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:57:25,031][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:57:25,688][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:57:26,345][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:57:27,003][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:57:27,660][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:57:28,317][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:57:28,975][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:57:29,632][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:57:30,290][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:57:30,947][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:57:31,604][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:57:32,261][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:57:32,919][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:57:33,576][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:57:34,233][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:57:34,891][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:57:35,549][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:57:36,206][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:57:36,863][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:57:37,521][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:57:38,178][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:57:38,836][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:57:39,494][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:57:40,151][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:57:40,808][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:57:41,467][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:57:42,124][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:57:42,781][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:57:43,439][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:57:44,096][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:57:44,753][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:57:45,696][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:57:46,354][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:57:47,011][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:57:47,668][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:57:48,326][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:57:48,983][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:57:49,640][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:57:50,299][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:57:50,956][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:57:51,614][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:57:52,272][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:57:52,929][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:57:53,586][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:57:54,244][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:57:54,902][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:57:55,559][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:57:56,217][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:57:56,939][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 01:57:58,254][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:57:58,257][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:57:58,258][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:57:59,647][__main__][INFO] - Iteration 779 took 53s (11.41% Gen, 86.01% Train). Generation: 6s, Training: 46s. Estimated remaining time: 3h 4m 30s. Estimated total time: 14h 56m 24s. Time estimates for 10 more iterations: 8m 57s, 100 more iterations: 1h 29m 38s, 500 more iterations: 7h 28m 12s. [2026-03-26 01:57:59,649][__main__][INFO] - Starting iteration 779. [2026-03-26 01:57:59,654][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2026-03-26 01:57:59,654][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:58:05,676][__main__][INFO] - Number of regex retries in iteration 779: 0 [2026-03-26 01:58:05,677][__main__][INFO] - agents played in iteration 779 are Bob, Alice [2026-03-26 01:58:06,301][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:58:06,362][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:58:06,363][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:58:06,363][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:58:07,040][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:58:07,644][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:58:08,302][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:58:08,959][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:58:09,616][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:58:10,274][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:58:10,930][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:58:11,588][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:58:12,245][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:58:12,902][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:58:13,559][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:58:14,216][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:58:14,873][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:58:15,531][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:58:16,188][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:58:16,845][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:58:17,502][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:58:18,160][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:58:18,817][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:58:19,474][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:58:20,132][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:58:20,789][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:58:21,446][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:58:22,104][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:58:22,761][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:58:23,418][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:58:24,076][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:58:24,734][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:58:25,391][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:58:26,048][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:58:26,706][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:58:27,363][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:58:28,021][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:58:28,678][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:58:29,336][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:58:29,994][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:58:30,651][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:58:31,308][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:58:31,966][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:58:32,623][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:58:33,281][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:58:33,938][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:58:34,596][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:58:35,255][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:58:35,913][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:58:36,572][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:58:37,231][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:58:37,889][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:58:38,859][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:58:39,518][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:58:40,175][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:58:40,833][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:58:41,491][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:58:42,149][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:58:42,807][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:58:43,465][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:58:44,123][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:58:44,782][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:58:45,440][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:58:46,097][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:58:46,757][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:58:47,415][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:58:48,072][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:58:48,730][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:58:49,388][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:58:50,185][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 01:58:51,651][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:58:51,655][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:58:51,689][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:58:53,391][__main__][INFO] - Iteration 780 took 53s (11.21% Gen, 85.62% Train). Generation: 6s, Training: 46s. Estimated remaining time: 3h 2m 51s. Estimated total time: 14h 55m 39s. Time estimates for 10 more iterations: 8m 57s, 100 more iterations: 1h 29m 33s, 500 more iterations: 7h 27m 49s. [2026-03-26 01:58:53,394][__main__][INFO] - Starting iteration 780. [2026-03-26 01:58:53,399][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2026-03-26 01:58:53,400][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:58:59,229][__main__][INFO] - Number of regex retries in iteration 780: 0 [2026-03-26 01:58:59,231][__main__][INFO] - agents played in iteration 780 are Bob, Alice [2026-03-26 01:58:59,729][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:58:59,790][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:58:59,791][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:58:59,792][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:59:00,467][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:59:01,081][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:59:01,741][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:59:02,399][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:59:03,057][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:59:03,715][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:59:04,373][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:59:05,031][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:59:05,689][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:59:06,347][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:59:07,006][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:59:07,664][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:59:08,322][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:59:08,981][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:59:09,639][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:59:10,297][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:59:10,956][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:59:11,615][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:59:12,274][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:59:12,932][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:59:13,591][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:59:14,249][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:59:14,907][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:59:15,565][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:59:16,224][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:59:16,882][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:59:17,540][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:59:18,199][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:59:18,857][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:59:19,515][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:59:20,174][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:59:20,832][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:59:21,491][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:59:22,149][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:59:22,808][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:59:23,467][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:59:24,126][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:59:24,784][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:59:25,443][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:59:26,102][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:59:26,760][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:59:27,419][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:59:28,078][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:59:28,738][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:59:29,396][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:59:30,054][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:59:30,713][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:59:31,372][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:59:32,366][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:59:33,024][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:59:33,682][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:59:34,339][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:59:34,996][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:59:35,654][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:59:36,311][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:59:36,968][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:59:37,626][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:59:38,283][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:59:38,941][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:59:39,598][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:59:40,256][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:59:40,914][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:59:41,571][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:59:42,228][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:59:42,886][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:59:43,581][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 01:59:44,892][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:59:44,895][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:59:44,897][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:59:46,263][__main__][INFO] - Iteration 781 took 52s (11.03% Gen, 86.38% Train). Generation: 5s, Training: 45s. Estimated remaining time: 2h 47m 26s. Estimated total time: 14h 41m 6s. Time estimates for 10 more iterations: 8m 48s, 100 more iterations: 1h 28m 6s, 500 more iterations: 7h 20m 33s. [2026-03-26 01:59:46,265][__main__][INFO] - Starting iteration 781. [2026-03-26 01:59:46,270][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2026-03-26 01:59:46,271][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:59:51,363][__main__][INFO] - Number of regex retries in iteration 781: 0 [2026-03-26 01:59:51,365][__main__][INFO] - agents played in iteration 781 are Bob, Alice [2026-03-26 01:59:51,995][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:59:52,056][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 01:59:52,057][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:59:52,057][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:59:52,723][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:59:53,337][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:59:53,997][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:59:54,655][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:59:55,313][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:59:55,971][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:59:56,630][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:59:57,288][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:59:57,946][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:59:58,604][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:59:59,262][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:59:59,920][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 02:00:00,582][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 02:00:01,237][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 02:00:01,896][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 02:00:02,554][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 02:00:03,212][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 02:00:03,870][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 02:00:04,529][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 02:00:05,187][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 02:00:05,846][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 02:00:06,504][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 02:00:07,163][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 02:00:07,822][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 02:00:08,480][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 02:00:09,139][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 02:00:09,798][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 02:00:10,456][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 02:00:11,116][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 02:00:11,773][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 02:00:12,432][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 02:00:13,090][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 02:00:13,749][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 02:00:14,407][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 02:00:15,066][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 02:00:15,724][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 02:00:16,383][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 02:00:17,041][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 02:00:17,699][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 02:00:18,358][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 02:00:19,017][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 02:00:19,675][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 02:00:20,333][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 02:00:20,992][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 02:00:21,650][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 02:00:22,310][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 02:00:22,968][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 02:00:23,627][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 02:00:24,612][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 02:00:25,270][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 02:00:25,927][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 02:00:26,585][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 02:00:27,242][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 02:00:27,899][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 02:00:28,558][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 02:00:29,215][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 02:00:29,873][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 02:00:30,531][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 02:00:31,188][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 02:00:31,846][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 02:00:32,504][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 02:00:33,161][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 02:00:33,819][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 02:00:34,476][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 02:00:35,134][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 02:00:35,856][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 02:00:37,185][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 02:00:37,188][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 02:00:37,189][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 02:00:38,584][__main__][INFO] - Iteration 782 took 52s (9.74% Gen, 87.59% Train). Generation: 5s, Training: 45s. Estimated remaining time: 2h 37m 23s. Estimated total time: 14h 31m 56s. Time estimates for 10 more iterations: 8m 43s, 100 more iterations: 1h 27m 11s, 500 more iterations: 7h 15m 58s. [2026-03-26 02:00:38,586][__main__][INFO] - Starting iteration 782. [2026-03-26 02:00:38,591][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2026-03-26 02:00:38,591][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 02:00:43,331][__main__][INFO] - Number of regex retries in iteration 782: 0 [2026-03-26 02:00:43,332][__main__][INFO] - agents played in iteration 782 are Bob, Alice [2026-03-26 02:00:43,895][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 02:00:43,957][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 02:00:43,957][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 02:00:43,958][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 02:00:44,646][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 02:00:45,262][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 02:00:45,921][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 02:00:46,577][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 02:00:47,234][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 02:00:47,891][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 02:00:48,548][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 02:00:49,204][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 02:00:49,861][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 02:00:50,518][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 02:00:51,175][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 02:00:51,832][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 02:00:52,490][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 02:00:53,147][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 02:00:53,805][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 02:00:54,462][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 02:00:55,120][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 02:00:55,777][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 02:00:56,434][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 02:00:57,092][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 02:00:57,749][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 02:00:58,406][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 02:00:59,064][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 02:00:59,721][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 02:01:00,379][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 02:01:01,037][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 02:01:01,694][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 02:01:02,352][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 02:01:03,009][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 02:01:03,666][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 02:01:04,323][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 02:01:04,981][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 02:01:05,638][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 02:01:06,295][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 02:01:06,952][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 02:01:07,610][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 02:01:08,267][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 02:01:08,924][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 02:01:09,582][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 02:01:10,240][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 02:01:10,897][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 02:01:11,554][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 02:01:12,211][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 02:01:12,869][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 02:01:13,527][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 02:01:14,184][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 02:01:14,841][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 02:01:15,499][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 02:01:16,414][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 02:01:17,071][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 02:01:17,728][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 02:01:18,386][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 02:01:19,043][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 02:01:19,700][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 02:01:20,358][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 02:01:21,015][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 02:01:21,672][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 02:01:22,330][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 02:01:22,991][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 02:01:23,646][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 02:01:24,304][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 02:01:24,961][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 02:01:25,619][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 02:01:26,276][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 02:01:26,934][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 02:01:27,671][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 02:01:29,009][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 02:01:29,012][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 02:01:29,014][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 02:01:30,474][__main__][INFO] - Iteration 783 took 51s (9.14% Gen, 88.04% Train). Generation: 4s, Training: 45s. Estimated remaining time: 2h 29m 20s. Estimated total time: 14h 24m 45s. Time estimates for 10 more iterations: 8m 38s, 100 more iterations: 1h 26m 28s, 500 more iterations: 7h 12m 22s. [2026-03-26 02:01:30,477][__main__][INFO] - Starting iteration 783. [2026-03-26 02:01:30,482][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2026-03-26 02:01:30,483][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 02:01:35,347][__main__][INFO] - Number of regex retries in iteration 783: 0 [2026-03-26 02:01:35,348][__main__][INFO] - agents played in iteration 783 are Bob, Alice [2026-03-26 02:01:35,965][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 02:01:36,026][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 02:01:36,027][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 02:01:36,028][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 02:01:36,699][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 02:01:37,324][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 02:01:37,982][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 02:01:38,640][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 02:01:39,299][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 02:01:39,957][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 02:01:40,616][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 02:01:41,274][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 02:01:41,932][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 02:01:42,591][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 02:01:43,250][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 02:01:43,908][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 02:01:44,567][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 02:01:45,225][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 02:01:45,883][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 02:01:46,541][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 02:01:47,199][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 02:01:47,857][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 02:01:48,515][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 02:01:49,173][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 02:01:49,831][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 02:01:50,489][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 02:01:51,147][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 02:01:51,806][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 02:01:52,465][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 02:01:53,123][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 02:01:53,781][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 02:01:54,439][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 02:01:55,098][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 02:01:55,757][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 02:01:56,416][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 02:01:57,075][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 02:01:57,734][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 02:01:58,392][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 02:01:59,051][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 02:01:59,709][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 02:02:00,367][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 02:02:01,026][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 02:02:01,684][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 02:02:02,342][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 02:02:03,000][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 02:02:03,658][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 02:02:04,316][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 02:02:04,975][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 02:02:05,634][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 02:02:06,292][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 02:02:06,950][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 02:02:07,608][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 02:02:08,528][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 02:02:09,186][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 02:02:09,843][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 02:02:10,501][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 02:02:11,158][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 02:02:11,815][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 02:02:12,473][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 02:02:13,130][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 02:02:13,787][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 02:02:14,445][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 02:02:15,102][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 02:02:15,759][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 02:02:16,417][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 02:02:17,075][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 02:02:17,732][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 02:02:18,390][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 02:02:19,047][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 02:02:19,754][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 02:02:21,111][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 02:02:21,114][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 02:02:21,115][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 02:02:22,650][__main__][INFO] - Iteration 784 took 52s (9.33% Gen, 87.73% Train). Generation: 4s, Training: 45s. Estimated remaining time: 2h 33m 13s. Estimated total time: 14h 29m 30s. Time estimates for 10 more iterations: 8m 41s, 100 more iterations: 1h 26m 57s, 500 more iterations: 7h 14m 45s. [2026-03-26 02:02:22,652][__main__][INFO] - Starting iteration 784. [2026-03-26 02:02:22,659][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2026-03-26 02:02:22,659][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 02:02:28,788][__main__][INFO] - Number of regex retries in iteration 784: 0 [2026-03-26 02:02:28,789][__main__][INFO] - agents played in iteration 784 are Bob, Alice [2026-03-26 02:02:29,742][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 02:02:29,806][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 02:02:29,807][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 02:02:29,808][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 02:02:30,487][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 02:02:31,125][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 02:02:31,783][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 02:02:32,439][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 02:02:33,097][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 02:02:33,754][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 02:02:34,411][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 02:02:35,068][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 02:02:35,725][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 02:02:36,456][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 02:02:37,112][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 02:02:37,769][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 02:02:38,426][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 02:02:39,083][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 02:02:39,741][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 02:02:40,398][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 02:02:41,055][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 02:02:41,712][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 02:02:42,370][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 02:02:43,027][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 02:02:43,685][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 02:02:44,341][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 02:02:44,998][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 02:02:45,656][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 02:02:46,312][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 02:02:46,969][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 02:02:47,627][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 02:02:48,284][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 02:02:48,941][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 02:02:49,598][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 02:02:50,255][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 02:02:50,912][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 02:02:51,570][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 02:02:52,227][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 02:02:52,884][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 02:02:53,541][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 02:02:54,199][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 02:02:54,856][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 02:02:55,513][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 02:02:56,170][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 02:02:56,827][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 02:02:57,485][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 02:02:58,142][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 02:02:58,800][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 02:02:59,457][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 02:03:00,114][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 02:03:00,772][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 02:03:01,429][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 02:03:02,398][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 02:03:03,057][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 02:03:03,714][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 02:03:04,371][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 02:03:05,028][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 02:03:05,686][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 02:03:06,343][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 02:03:07,000][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 02:03:07,658][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 02:03:08,315][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 02:03:08,972][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 02:03:09,630][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 02:03:10,288][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 02:03:10,945][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 02:03:11,603][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 02:03:12,260][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 02:03:12,917][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 02:03:13,660][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 02:03:15,045][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 02:03:15,048][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 02:03:15,049][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 02:03:16,538][__main__][INFO] - Iteration 785 took 53s (11.38% Gen, 85.86% Train). Generation: 6s, Training: 46s. Estimated remaining time: 3h 0m 49s. Estimated total time: 14h 58m 1s. Time estimates for 10 more iterations: 8m 58s, 100 more iterations: 1h 29m 48s, 500 more iterations: 7h 29m 0s. [2026-03-26 02:03:16,547][__main__][INFO] - Starting iteration 785. [2026-03-26 02:03:16,559][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2026-03-26 02:03:16,560][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 02:03:22,751][__main__][INFO] - Number of regex retries in iteration 785: 0 [2026-03-26 02:03:22,752][__main__][INFO] - agents played in iteration 785 are Bob, Alice [2026-03-26 02:03:23,955][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 02:03:24,016][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 02:03:24,022][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 02:03:24,023][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 02:03:24,702][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 02:03:25,313][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 02:03:25,971][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 02:03:26,627][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 02:03:27,284][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 02:03:27,941][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 02:03:28,598][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 02:03:29,255][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 02:03:29,911][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 02:03:30,568][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 02:03:31,225][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 02:03:31,882][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 02:03:32,540][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 02:03:33,196][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 02:03:33,853][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 02:03:34,511][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 02:03:35,168][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 02:03:35,826][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 02:03:36,483][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 02:03:37,140][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 02:03:37,797][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 02:03:38,455][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 02:03:39,112][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 02:03:39,770][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 02:03:40,428][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 02:03:41,085][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 02:03:41,741][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 02:03:42,399][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 02:03:43,056][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 02:03:43,713][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 02:03:44,371][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 02:03:45,028][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 02:03:45,685][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 02:03:46,343][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 02:03:47,000][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 02:03:47,657][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 02:03:48,315][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 02:03:48,972][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 02:03:49,629][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 02:03:50,286][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 02:03:50,945][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 02:03:51,602][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 02:03:52,260][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 02:03:52,918][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 02:03:53,575][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 02:03:54,233][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 02:03:54,890][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 02:03:55,547][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 02:03:56,446][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 02:03:57,104][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 02:03:57,762][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 02:03:58,420][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 02:03:59,078][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 02:03:59,736][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 02:04:00,393][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 02:04:01,051][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 02:04:01,709][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 02:04:02,367][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 02:04:03,025][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 02:04:03,684][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 02:04:04,342][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 02:04:04,999][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 02:04:05,658][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 02:04:06,315][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 02:04:06,973][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 02:04:07,726][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:43 [2026-03-26 02:04:09,222][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 02:04:09,225][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 02:04:09,258][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 02:04:11,051][__main__][INFO] - Iteration 786 took 54s (11.36% Gen, 85.34% Train). Generation: 6s, Training: 46s. Estimated remaining time: 3h 10m 7s. Estimated total time: 15h 8m 13s. Time estimates for 10 more iterations: 9m 4s, 100 more iterations: 1h 30m 49s, 500 more iterations: 7h 34m 6s. [2026-03-26 02:04:11,053][__main__][INFO] - Starting iteration 786. [2026-03-26 02:04:11,057][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2026-03-26 02:04:11,058][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 02:04:16,789][__main__][INFO] - Number of regex retries in iteration 786: 0 [2026-03-26 02:04:16,790][__main__][INFO] - agents played in iteration 786 are Bob, Alice [2026-03-26 02:04:17,423][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 02:04:17,484][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 02:04:17,485][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 02:04:17,486][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 02:04:18,143][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 02:04:18,767][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 02:04:19,426][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 02:04:20,084][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 02:04:20,742][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 02:04:21,400][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 02:04:22,058][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 02:04:22,715][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 02:04:23,373][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 02:04:24,031][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 02:04:24,689][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 02:04:25,347][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 02:04:26,005][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 02:04:26,664][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 02:04:27,321][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 02:04:27,979][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 02:04:28,638][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 02:04:29,296][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 02:04:29,954][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 02:04:30,611][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 02:04:31,270][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 02:04:31,928][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 02:04:35,654][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 02:04:36,326][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 02:04:36,983][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 02:04:37,640][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 02:04:38,297][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 02:04:38,954][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 02:04:39,611][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 02:04:40,269][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 02:04:40,926][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 02:04:41,583][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 02:04:42,240][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 02:04:42,897][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 02:04:43,554][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 02:04:44,211][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 02:04:44,869][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 02:04:45,526][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 02:04:46,183][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 02:04:46,840][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 02:04:47,497][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 02:04:48,155][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 02:04:48,813][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 02:04:49,470][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 02:04:50,128][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 02:04:50,785][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 02:04:51,442][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 02:04:52,100][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 02:04:53,134][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 02:04:53,793][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 02:04:54,450][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 02:04:55,108][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 02:04:55,768][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 02:04:56,425][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 02:04:57,082][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 02:04:57,740][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 02:04:58,398][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 02:04:59,055][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 02:04:59,713][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 02:05:00,370][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 02:05:01,027][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 02:05:01,685][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 02:05:02,342][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 02:05:03,000][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 02:05:03,657][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 02:05:04,359][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.93%, Block Peak % of device VRAM: 25.90%, ΔTime: 00:00:46 [2026-03-26 02:05:05,693][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/policy_optimizer_state.pt [2026-03-26 02:05:05,696][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/critic_optimizer_state.pt [2026-03-26 02:05:05,697][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer/seed_0/agent_trainer/trainer_annealing_state.pkl [2026-03-26 02:05:07,059][__main__][INFO] - Iteration 787 took 56s (10.24% Gen, 87.33% Train). Generation: 5s, Training: 48s. Estimated remaining time: 3h 34m 21s. Estimated total time: 15h 33m 23s. Time estimates for 10 more iterations: 9m 20s, 100 more iterations: 1h 33m 20s, 500 more iterations: 7h 46m 41s. [2026-03-26 02:05:07,061][__main__][INFO] - Starting iteration 787. [2026-03-26 02:05:07,065][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2026-03-26 02:05:07,066][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 02:05:11,979][__main__][INFO] - Number of regex retries in iteration 787: 0 [2026-03-26 02:05:11,981][__main__][INFO] - agents played in iteration 787 are Bob, Alice [2026-03-26 02:05:12,488][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 02:05:12,551][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.68%, Block Peak % of device VRAM: 19.45%, ΔTime: 00:00:00 [2026-03-26 02:05:12,551][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 02:05:12,552][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 02:05:13,243][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 02:05:13,861][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 02:05:14,519][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 02:05:15,176][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 02:05:15,833][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 02:05:16,490][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 02:05:17,147][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 02:05:17,804][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 02:05:18,461][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 02:05:19,118][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 02:05:19,775][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 02:05:20,432][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 02:05:21,089][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 02:05:21,746][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 02:05:22,404][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 02:05:23,061][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 02:05:23,718][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 02:05:24,375][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 02:05:25,032][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 02:05:25,689][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 02:05:26,347][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 02:05:27,004][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 02:05:27,661][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 02:05:28,320][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 02:05:28,977][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 02:05:29,634][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 02:05:30,292][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 02:05:30,949][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 02:05:31,607][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 02:05:32,264][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 02:05:32,922][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 02:05:33,579][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 02:05:34,238][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256